Strus background
Why build yet another search engine
Why?
Why not?
Personal motivation
The project Strus started just out of frustration. I was working on a project that was doomed from the beginning for two years. After realizing that it had to fail, I had to focus on finding a new job. Unfortunately, this proved to be more difficult than I thought. I had to do something to keep myself spinning and I wanted to do something on a topic that I knew by heart. I was working for ten years in a company providing services for information retrieval and also implementing the core of the search engine for these services. So I knew what it was about and the problems I had to face. In September 2014 I started the project Strus.
What is Strus?
Strus is a set of components (libraries, programs, and language bindings) to build the core of a scalable full-text search engine. It aims to cover classical IR as well as structured search for arbitrarily complex expressions on the boolean algebra of sets of term occurrences (d,p) where d references a document and p a discrete position number. Besides matching of expressions, Strus also provides a mechanism to attach variables on subexpression matches than can be referenced in the presentation of the query result.
What distiguishes Strus from other fulltext search engines
Outsourcing the data storage
Strus can be built on any modern NOSQL key/value database that has an upper bound seek to implement its data storage. This reduces the complexity of the problem (the Strus core with storage and query evaluation has about 32000 lines of code). Alternative implementations for the database can be provided by experts on the topic.
Modeling of structured queries
The modeling of structured queries is more rigid than in other search engines like for example Lucene. This means that you can not do everything you can do with Lucene. Strus describes every structure as an N-ary operator on sets of pairs document number, position. The operators are implemented as iterators and can be assembled to arbitrary complex expression trees. The intention of this reduction is simplicity. But even with this simple approach, you can do most of the things you would expect to have. For example seeking a phrase with a nearby operator relating to another phrase with a condition that they should appear in the same sentence can be expressed easily with such expressions.
Information extraction
In Strus the result of the set operations needed to model expressions is implemented as an iterator. The advantage of this model is not only that you do not need memory for storing intermediate results. You also do not need to care about what information to collect for an intermediate result. As consequence, all information about a match is available in the moment you inspect the match. A weighting function or a summarizer can access to positions of all subexpression matches without expensive introspection. This empowers summarization to collect information that is part of a match or close to a match. Hence you can implement powerful information extraction for feature selection or feature extraction.