About
The open source project Strus provides a collection
libraries and command line tools written in C++ for building a competitive,
scalable full-text search engine.
The Strus search engine can be built using any key value store database that provides an
upper bound seek function
for the stored key/value pairs. Currently, there exists an implementation based on the
LevelDB library.
Demo system
There is a demo system online for Strus:
A search on the complete Wikipedia collection (English)
(ALERT: The Intel NUC machine hosting the demo search broke down. The demo search is currently NOT available.)
with a description how the system can be installed and the index built.
Tutorials
A tutorial for building a simple search engine with PHP is available here.
Another tutorial for building a nontrivial search engine also covering the insert case with Python and the Tornado web framework is available here.
An introduction that shows how to write dynamically loadable extensions of the Strus core
in C++ can be found here.
An article that shows how scalable search engines can be built with Strus (distributing the search index) can be found here.
Installation
For installation see description files INSTALL.<platform.> in the top level directory of each project. The descriptions of the project installations are intended to be self contained. For example for Ubuntu: strusBase, strus, strusAnalyzer, strusTrace, strusModule, strusRpc, strusVector, strusPattern, strusUtilities, strusBindings and strusAll.
Regular Builds
The software is built regurarly on Travis (on commit of the master branch) and on the OpenSUSE build cluster (currently triggered manually).
Documentation
The documentation is work in progress. What is available can be found here.
The inverted case of pattern matching
The documentation for doing high performance pattern matching with many thousands of patterns can be found here. It allows you to define tokens with regular expressions in text and to define patterns with these tokens as alphabet. With the help of Hyperscan that builds one DFA for all regular expressions StrusPattern gets a competitive performance. Approximative matching on Unicode strings with edit distance is also supported.
Story
Why build yet another search engine? Here I tell about my motivation and the story of Strus. I also try to explain, what distinguishes Strus from other search engine software.
News
After a long pause due to illness and recovery, I am trying to bring up the Wikipedia Search again. During the recovery I did something completely different: Mewa, a compiler compiler for prototyping compiler front-ends including a proof of concept language. |
|
The Intel NUC machine hosting the demo search broke down and I was not able to bring it up again yet. The demo search is currently NOT available. |
|
Continuing work on the web service and the replacement of the strus demo search on Wikipedia. |
|
Due to private issues the resources for further development of strus are not available from now till mid of June 2020. |
|
The strus demo search is currently not reachable due to a NAT problem. I changed my provider. The strus demo search on Wikipedia will be replaced soon. |
|
. The Wikipedia demo search will soon be replaced by a version run by the web service, that operates as close to REST as possible. |
|
The project strusWebService is now built as part of strusAll if enabled (WITH_STRUS_WEBSERVICE=YES). The JSON and XML schemas for the webrequests are now generated in the build of the strusWebService. Currently the generated schemas are only informal (not tested yet). The next step is to rebuild the demo search on Wikipedia with the webservice, using the pieces recently implemented as for example the new query sentence analysis. |
|
Version 0.17 of strus is out. |
|
The vector storage interface for the representation of word embeddings (word2vec and friends) has been rewritten. Things that have been proven to be useless like the categorization of vectors have been dropped. Some issues have been solved in the travis build. There are still some problems with the OSX build of the strusVector project. |
|
Though it's obviously not perfect yet, POS tagging gets closer to beeing usable. Here you find an example document. |
|
Documentation about the POS tagging is available here. It is work in progress and therefore not complete yet. |
|
I gave up the idea to use Tensorflow/Syntaxnet for POS tagging of the Wikipedia collection and switched back to NLTK. I was not able to get a processing rate I needed with the hardware I have using Tensorflow/Syntaxnet. |
|
The Wikimedia to XML conversion is working and decent. Structures for retrieval have been implemented. Also POS (part of speech) tagging and the dedicated machine for Tensorflow. Development will be shut down till mid of September. I will be on a longer vacation. The next step will be the POS tagging of the collection. |
|
The Wikimedia to XML conversion has been rewritten and a better XML schema has been defined. A documentation can be found here. The program takes the Wikipedia XML dump and creates one XML file for every document. The XML schema used tries to map the Wikipedia content in a way appropriate for information retrieval. It is intended to be usable for other projects outside the strus context too. Currently, The project strusWikipediaSearch is in an intermediate state and needs cleanup, but it should be possible to build it. |
|
Some new bugs appeared in the Wikipedia demo search with the latest updates of the Python bindings. Query terms with non ascii characters like umlauts are not normalized correctly anymore. Summarization seems also to be broken, some characters on the tail of words are swallowed. We will fix these issues soon. |
|
The Wikipedia demo search has been updated and is running now with latest software. The development of the strusWebService is ongoing. |
|
We reactivated the OpenSUSE builds and solved the issues related to the Travis build. There are still some open issues in the OpenSUSE builds left, especially with the language bindings, but we intend to solve them soon. |
|
Improved language bindings based on Papuga. Allowing now data structures to pass as classes with data members also for PHP7 and Python3 bindings. Return value structures can be declared as structures with data members instead of returning them as dictionaries. |
|
The documentations of the language bindings for Lua, PHP7 and Python3 based on Papuga are available now. Click on sidebar language logos. |
|
All Travis builds for the Strus github projects are visible now. They provide automated builds on commits on the master branch. Most of the builds show errors, but mainly for OSX. The Linux builds should work. |
|
We opened a Travis build for the project strusAll (containing all of the Strus projects) with automated builds on commits on the master branch. |
|
We have temporarily given up packaging of Strus. We have to decide what type of deployment and infrastructure we will actively support in the future. Probably a reduced set of platforms and images. |
|
We temporarily give up the support for Java bindings. Strus will only support language bindings for value typed scripting languages like PHP, Python, Lua and Node.Js. Java will be supported as proxy calling Strus running as a web service. |
|
We are currently working on a replacement of the language bindings. The Lua bindings are finished and PHP 7,Python 3.0 and other value typed scripting languages will follow soon. The previous version of the language bindings were a big obstacle and risk for further development. |
|
Updated documentation of the domain specific languages used by the command line utilities to describe document and query analysis, query evaluation and the query language. |
|
New Github project strusAll to simiplify out of source builds of Strus. |
|
Changed status from pre-Alpha to Alpha as most of the features are available and stable. Strus is already successfully used in projects. But the interface may still undergo changes and the versioning does not follow strict rules yet. |
|
Improved the Wikipedia demo search. The search takes references to documents close to query terms into account besides weighting the appearance of query terms in documents. We also added a search for the closest vectors (cosine similarity) to the entities appearing in the query. The vectors (about 10 million vectors with dimension 300) were created with word2vec on the Wikipedia collection. |
|
We started a new project StrusPattern for deeper document analysis. |
|
The work on the strusWebservice is ongoing. A slide show that helps to figure out where we are heading to in this project can be found here. |
|
There is a Travis build available now for strus at travis-ci.org |
|
We have implemented an NGRAM normalizer and a tokenizer and normalizer for terms defined by regular expressions on text. You find all functions implemented till now here. |
|
A document segmenter for JSON (based on the cJSON library) has been implemented for the Strus analyzer project. Selection expressions are also formulated in the abbreviated syntax of XPath as for XML. |
|
A new article has been published, that shows how to create call traces for Strus for debugging, statistical analysis and deeper understandig of the software. |
|
Strus can be build on OS X. Unfortunately we cannot provide packages yet. But at least you can build the software on your own. |
|
The project is now sponsored by Eurospider. The feedback from sophisticated retrieval projects and the development of new components as the Strus webservice project will bring the Strus forward. |
|
Providing some query performance numbers for the Wikipedia demo system running on an Intel NUC. |
|
License of Strus changed from GPLv3 to MPLv2 (Mozilla Public License, Version 2.0). We were looking for a license, that on one hand protects the work done for Strus and on the other hand allows users of Strus to attribute their own work in their way, even as closed source. We think that the Mozilla Public License, Version 2.0 meets these requirements best. Fortunately Strus did never include nor link against any pervasively licenced code, so this change of the software license is possible with the agreement of all contributors. |
|
The Wikipedia demo search engine is now hosted on an Intel NUC. Read more.... |
|
New article published on codeproject about writing Strus extension modules in C++. |
|
Article that shows how scalable search engines can be built with Strus. |
|
A tutorial for building a search engine with Strus based on Python and the Tornado web framework has been published on codeproject. |
|
The NBLNK weighting scheme for the Wikipedia demo is now online. This weighting scheme does not match documents against the query, but ranks the links in documents by weighting the sentences the links appear in against the query. It is a good example for the information extraction capabilities of Strus. |
|
Packages of the latest build are available now. |
|
Python bindings are available now. |
|
Language bindings for Java are now available. |
|
A docker image and a tutorial is available for Strus. |
|
Started advertising Strus to get some feedback and maybe even some support. |
|
The demo project of a search engine for the Wikipedia collection (english) is online. |
|
The demo project of a search engine for the Wikipedia collection (english) is close to be finished. |
© 2015 Patrick Frey
Original template design by Andreas Viklund / Best hosted at www.svenskadomaner.se