Strus

About

The open source project Strus provides a collection libraries and command line tools written in C++ for building a competitive, scalable full-text search engine.
The Strus search engine can be built using any key value store database that provides an upper bound seek function for the stored key/value pairs. Currently, there exists an implementation based on the LevelDB library.

Demo system

There is a demo system online for Strus:
A search on the complete Wikipedia collection (English) (ALERT: The Intel NUC machine hosting the demo search broke down. The demo search is currently NOT available.) with a description how the system can be installed and the index built.

Tutorials

A tutorial for building a simple search engine with PHP is available here.
Another tutorial for building a nontrivial search engine also covering the insert case with Python and the Tornado web framework is available here.
An introduction that shows how to write dynamically loadable extensions of the Strus core in C++ can be found here.
An article that shows how scalable search engines can be built with Strus (distributing the search index) can be found here.

Installation

For installation see description files INSTALL.<platform.> in the top level directory of each project. The descriptions of the project installations are intended to be self contained. For example for Ubuntu: strusBase, strus, strusAnalyzer, strusTrace, strusModule, strusRpc, strusVector, strusPattern, strusUtilities, strusBindings and strusAll.

Regular Builds

The software is built regurarly on Travis (on commit of the master branch) and on the OpenSUSE build cluster (currently triggered manually).

Documentation

The documentation is work in progress. What is available can be found here.

The inverted case of pattern matching

The documentation for doing high performance pattern matching with many thousands of patterns can be found here. It allows you to define tokens with regular expressions in text and to define patterns with these tokens as alphabet. With the help of Hyperscan that builds one DFA for all regular expressions StrusPattern gets a competitive performance. Approximative matching on Unicode strings with edit distance is also supported.

Story

Why build yet another search engine? Here I tell about my motivation and the story of Strus. I also try to explain, what distinguishes Strus from other search engine software.

News

Jul 20th, 2021	After a long pause due to illness and recovery, I am trying to bring up the Wikipedia Search again. During the recovery I did something completely different: Mewa, a compiler compiler for prototyping compiler front-ends including a proof of concept language.
Jan 23rd, 2021	The Intel NUC machine hosting the demo search broke down and I was not able to bring it up again yet. The demo search is currently NOT available.
Jul 2nd, 2020	Continuing work on the web service and the replacement of the strus demo search on Wikipedia.
Apr 24th, 2020	Due to private issues the resources for further development of strus are not available from now till mid of June 2020.
Apr 2th, 2020	The strus demo search is currently not reachable due to a NAT problem. I changed my provider. The strus demo search on Wikipedia will be replaced soon.
Nov 6th, 2019	. The Wikipedia demo search will soon be replaced by a version run by the web service, that operates as close to REST as possible.
Mar 22th, 2019	The project strusWebService is now built as part of strusAll if enabled (WITH_STRUS_WEBSERVICE=YES). The JSON and XML schemas for the webrequests are now generated in the build of the strusWebService. Currently the generated schemas are only informal (not tested yet). The next step is to rebuild the demo search on Wikipedia with the webservice, using the pieces recently implemented as for example the new query sentence analysis.
Jan 24th, 2019	Version 0.17 of strus is out.
Jan 11th, 2019	The vector storage interface for the representation of word embeddings (word2vec and friends) has been rewritten. Things that have been proven to be useless like the categorization of vectors have been dropped. Some issues have been solved in the travis build. There are still some problems with the OSX build of the strusVector project.
Dec 7th, 2018	Though it's obviously not perfect yet, POS tagging gets closer to beeing usable. Here you find an example document.
Nov 25th, 2018	Documentation about the POS tagging is available here. It is work in progress and therefore not complete yet.
Oct 24th, 2018	I gave up the idea to use Tensorflow/Syntaxnet for POS tagging of the Wikipedia collection and switched back to NLTK. I was not able to get a processing rate I needed with the hardware I have using Tensorflow/Syntaxnet.
Aug 16th, 2018	The Wikimedia to XML conversion is working and decent. Structures for retrieval have been implemented. Also POS (part of speech) tagging and the dedicated machine for Tensorflow. Development will be shut down till mid of September. I will be on a longer vacation. The next step will be the POS tagging of the collection.
July 12th, 2018	The Wikimedia to XML conversion has been rewritten and a better XML schema has been defined. A documentation can be found here. The program takes the Wikipedia XML dump and creates one XML file for every document. The XML schema used tries to map the Wikipedia content in a way appropriate for information retrieval. It is intended to be usable for other projects outside the strus context too. Currently, The project strusWikipediaSearch is in an intermediate state and needs cleanup, but it should be possible to build it.
June 13th, 2018	Some new bugs appeared in the Wikipedia demo search with the latest updates of the Python bindings. Query terms with non ascii characters like umlauts are not normalized correctly anymore. Summarization seems also to be broken, some characters on the tail of words are swallowed. We will fix these issues soon.
May 15th, 2018	The Wikipedia demo search has been updated and is running now with latest software. The development of the strusWebService is ongoing.
Jan 3rd, 2018	We reactivated the OpenSUSE builds and solved the issues related to the Travis build. There are still some open issues in the OpenSUSE builds left, especially with the language bindings, but we intend to solve them soon.
Nov 12th, 2017	Improved language bindings based on Papuga. Allowing now data structures to pass as classes with data members also for PHP7 and Python3 bindings. Return value structures can be declared as structures with data members instead of returning them as dictionaries.
Sep 26th, 2017	The documentations of the language bindings for Lua, PHP7 and Python3 based on Papuga are available now. Click on sidebar language logos.
Sep 22th, 2017	All Travis builds for the Strus github projects are visible now. They provide automated builds on commits on the master branch. Most of the builds show errors, but mainly for OSX. The Linux builds should work.
Sep 21th, 2017	We opened a Travis build for the project strusAll (containing all of the Strus projects) with automated builds on commits on the master branch.
Aug 20th, 2017	We have temporarily given up packaging of Strus. We have to decide what type of deployment and infrastructure we will actively support in the future. Probably a reduced set of platforms and images.
Aug 20th, 2017	We temporarily give up the support for Java bindings. Strus will only support language bindings for value typed scripting languages like PHP, Python, Lua and Node.Js. Java will be supported as proxy calling Strus running as a web service.
July 19th, 2017	We are currently working on a replacement of the language bindings. The Lua bindings are finished and PHP 7,Python 3.0 and other value typed scripting languages will follow soon. The previous version of the language bindings were a big obstacle and risk for further development.
March 17th, 2017	Updated documentation of the domain specific languages used by the command line utilities to describe document and query analysis, query evaluation and the query language.
Feb 22th, 2017	New Github project strusAll to simiplify out of source builds of Strus.
Feb 17th, 2017	Changed status from pre-Alpha to Alpha as most of the features are available and stable. Strus is already successfully used in projects. But the interface may still undergo changes and the versioning does not follow strict rules yet.
Feb 16th, 2017	Improved the Wikipedia demo search. The search takes references to documents close to query terms into account besides weighting the appearance of query terms in documents. We also added a search for the closest vectors (cosine similarity) to the entities appearing in the query. The vectors (about 10 million vectors with dimension 300) were created with word2vec on the Wikipedia collection.
Jul 10th, 2016	We started a new project StrusPattern for deeper document analysis.
Jul 10th, 2016	The work on the strusWebservice is ongoing. A slide show that helps to figure out where we are heading to in this project can be found here.
Jul 10th, 2016	There is a Travis build available now for strus at travis-ci.org
Jul 10th, 2016	We have implemented an NGRAM normalizer and a tokenizer and normalizer for terms defined by regular expressions on text. You find all functions implemented till now here.
May 10th, 2016	A document segmenter for JSON (based on the cJSON library) has been implemented for the Strus analyzer project. Selection expressions are also formulated in the abbreviated syntax of XPath as for XML.
May 4th, 2016	A new article has been published, that shows how to create call traces for Strus for debugging, statistical analysis and deeper understandig of the software.
Apr 4th, 2016	Strus can be build on OS X. Unfortunately we cannot provide packages yet. But at least you can build the software on your own.
Mar 31st, 2016	The project is now sponsored by Eurospider. The feedback from sophisticated retrieval projects and the development of new components as the Strus webservice project will bring the Strus forward.
Mar 22nd, 2016	Providing some query performance numbers for the Wikipedia demo system running on an Intel NUC.
Mar 21st, 2016	License of Strus changed from GPLv3 to MPLv2 (Mozilla Public License, Version 2.0). We were looking for a license, that on one hand protects the work done for Strus and on the other hand allows users of Strus to attribute their own work in their way, even as closed source. We think that the Mozilla Public License, Version 2.0 meets these requirements best. Fortunately Strus did never include nor link against any pervasively licenced code, so this change of the software license is possible with the agreement of all contributors.
Mar 14th, 2016	The Wikipedia demo search engine is now hosted on an Intel NUC. Read more....
Feb 9th, 2016	New article published on codeproject about writing Strus extension modules in C++.
Dec 25th, 2015	Article that shows how scalable search engines can be built with Strus.
Nov 28th, 2015	A tutorial for building a search engine with Strus based on Python and the Tornado web framework has been published on codeproject.
Nov 16th, 2015	The NBLNK weighting scheme for the Wikipedia demo is now online. This weighting scheme does not match documents against the query, but ranks the links in documents by weighting the sentences the links appear in against the query. It is a good example for the information extraction capabilities of Strus.
Oct 29th, 2015	Packages of the latest build are available now.
Oct 21th, 2015	Python bindings are available now.
Aug 17th, 2015	Language bindings for Java are now available.
Jul 1st, 2015	A docker image and a tutorial is available for Strus.
June 15th, 2015	Started advertising Strus to get some feedback and maybe even some support.
Mai 27th, 2015	The demo project of a search engine for the Wikipedia collection (english) is online.
April 2nd, 2015	The demo project of a search engine for the Wikipedia collection (english) is close to be finished.