This document is the output of strusHelp --html -m analyzer_pattern -m storage_vector_std

Strus built-in functions

Query Processor

List of functions and operators predefined in the storage query processor

Posting join operator

List of posting join operators

chain Get the set of postings (d,p) that exist in the first argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| <= |rj| for i
chain_struct Get the set of postings (d,p) that exist in the second argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| <= |rj| for i2. Additionally there must not exist a posting in the first argument set that is overlapped by the interval formed by the other argument postings
contains Get the set of postings (d,1) for documents d that contain all of the argument features
diff Get the set of postings (d,p) that are in the first argument set but not in the second
inrange Get the set of postings (d,p) that exist in any argument set and (d,p+r) exist in all other argument sets with |r| <= |range|
inrange_struct Get the set of postings (d,p) that exist in any argument set and (d,p+r) exist in all other argument sets with |r| <= |range|. Additionally there must not exist a posting in the first argument set that is overlapped by the interval formed by the other argument postings.
intersect Get the set of postings (d,p) that are occurring in all argument sets
pred Get the set of postings (d,p-1) for all (d,p) with p>1 in the argument set
sequence Get the set of postings (d,p) that exist in the first argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| < |rj| for i
sequence_imm Get the set of postings (d,p) that exist in the first argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri|+1 == |rj| for i
sequence_struct Get the set of postings (d,p) that exist in the second argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| < |rj| for i2. Additionally there must not exist a posting in the first argument set that is overlapped by the interval formed by the other argument postings
succ Get the set of postings (d,p+1) for all (d,p) in the argument set
union Get the set of postings that are occurring in any argument set
within Get the set of postings (d,p) that exist in any argument set and distinct (d,p+r) exist in all other argument sets with |r| <= |range|
within_struct Get the set of postings (d,p) that exist in any argument set and distinct (d,p+r) exist in all other argument sets with |r| <= |range|. Additionally there must not exist a posting in the first argument set that is overlapped by the interval formed by the other argument postings.

Weighting function

List of query evaluation weighting functions

bm25 Calculate the document weight with the weighting scheme "BM25"
List of parameters
- match [Feature] defines the query features to weight
- k1 [Numeric] (1:1000) parameter of the BM25 weighting scheme
- b [Numeric] (0.0001:1000) parameter of the BM25 weighting scheme
- avgdoclen [Numeric] (0:) the average document lenght
- metadata_doclen [Metadata] the meta data element name referencing the document lenght for each document weighted
bm25pff Calculate the document weight with the weighting scheme "BM25pff". This is "BM25" where the feature frequency is counted by 1.0 per feature only for features with the maximum proximity score. The proximity score is a measure that takes the proximity of other query features into account
List of parameters
- match [Feature] defines the query features to weight
- struct [Feature] defines the delimiter for structures
- para [Feature] defines the delimiter for paragraphs (windows used for proximity weighting must not overlap paragraph borders)
- title [Feature] defines the title field (used for weighting increment of features appearing in title)
- k1 [Numeric] (1:1000) parameter of the BM25pff weighting scheme
- b [Numeric] (0.0001:1000) parameter of the BM25pff weighting scheme
- titleinc [Numeric] (0.0:) ff increment for title features
- cprop [Numeric] (0.0:1.0) constant part of idf proportional feature weight
- paragraphsize [Numeric] the estimated size of a paragraph
- sentencesize [Numeric] the estimated size of a sentence
- windowsize [Numeric] the size of the window used for finding features to increment proximity scores
- cardinality [Numeric] the number of query features a proximity score window must contain to be considered (optional, default is all features, percentage of input features specified with '%' suffix)
- ffbase [Numeric] (0.0:1.0) value in the range from 0.0 to 1.0 specifying the percentage of the constant score on the proximity ff for every feature occurrence. (with 1.0 the scheme is plain BM25)
- avgdoclen [Numeric] (0:) the average document lenght
- maxdf [Numeric] (0:) the maximum df as fraction of the collection size
- metadata_doclen [Metadata] the meta data element name referencing the document lenght for each document weighted
constant Calculate the weight of a document as sum of the the feature weights of the occurring features
List of parameters
- match [Feature] defines the query features to weight
metadata Calculate the weight of a document as value of a meta data element.
List of parameters
- name [Metadata] name of the meta data element to use as weight
scalar Calculate the document weight with a weighting scheme defined by a scalar function on metadata elements,constants and variables as arguments.
List of parameters
- function [String] defines the expression of the scalar function to execute
- metadata [Metadata] defines a meta data element as additional parameter of the function besides N (collection size). The parameter is addressed by the name of the metadata element in the expression
- [a-z]+ [Numeric] defines a variable value to be substituted in the scalar function expression
smart Calculate the document weight with a weighting scheme given by a scalar function defined as expression with ff (feature frequency), df (document frequency), N (total number of documents in the collection) and some specified metadata elements as arguments. The name of this method has been inspired by the traditional SMART weighting schemes in IR
List of parameters
- match [Feature] defines the query features to weight
- function [String] defines the expression of the scalar function to execute
- metadata [Metadata] defines a meta data element as additional parameter of the function besides ff,df,qf and N. The parameter is addressed by the name of the metadata element in the expression
- [a-z]+ [Numeric] defines a variable value to be substituted in the scalar function expression
tf Calculate the weight of a document as sum of the feature frequency of a feature multiplied with the feature weight
List of parameters
- match [Feature] defines the query features to weight
- weight [Numeric] (0:) defines the query feature weight factor

Summarizer

List of summarization functions for the presentation of a query evaluation result

accunear Extract and weight all elements in the forward index of a given type that are within a window with features specified.
List of parameters
- match [Feature] defines the query features to inspect for near matches
- struct [Feature] defines a structural delimiter for interaction of features on the same result
- type [String] the forward index feature type for the content to extract
- result [String] the name of the result if not equal to type
- cofactor [Numeric] multiplication factor for features pointing to the same result
- norm [Numeric] normalization factor for end result weights
- nofranks [Numeric] maximum number of ranks per document
- cardinality [Numeric] mimimum number of features per weighted item
- range [Numeric] maximum distance (ordinal position) of the weighted features (window size)
- cprop [Numeric] (0.0:1.0) constant part of idf proportional feature weight
accuvar Accumulate the weights of all contents of a variable in matching expressions. Weights with same positions are grouped and multiplied, the group results are added to the sum, the total weight assigned to the variable content.
List of parameters
- match [Feature] defines the query features to inspect for variable matches
- type [String] the forward index feature type for the content to extract
- var [String] the name of the variable referencing the content to weight
- nof [Numeric] (1:) the maximum number of the best weighted elements to return (default 10)
- norm [Numeric] (0.0:1.0) the normalization factor of the calculated weights (default 1.0)
- cofactor [Numeric] (0.0:) additional multiplier for coincident matches (default 1.0)
attribute Get the value of a document attribute.
List of parameters
- name [Attribute] the name of the attribute to get
forwardindex Get the complete forward index
List of parameters
- type [String] the forward index type to fetch the summary elements
- name [String] the name of the result attribute (default is the value of 'type'')
- N [Numeric] (1:) the maximum number of matches to return
matchphrase Get best matching phrases delimited by the structure postings
List of parameters
- match [Feature] defines the features to weight
- struct [Feature] defines the delimiter for structures
- para [Feature] defines the delimiter for paragraphs (summaries must not overlap paragraph borders)
- title [Feature] defines the title field of documents
- type [String] the forward index type of the result phrase elements
- paragraphsize [Numeric] (1:) estimated size of a paragraph
- sentencesize [Numeric] (1:) estimated size of a sentence, also a restriction for the maximum length of sentences in summaries
- windowsize [Numeric] (1:) maximum size of window used for identifying matches
- cardinality [Numeric] (1:) minimum number of features in a window
- maxdf [Numeric] (1:) the maximum df (fraction of collection size) of features considered for same sentence proximity weighing
- matchmark [String] specifies the markers (first character of the value is the separator followed by the two parts separated by it) for highlighting matches in the resulting phrases
- floatingmark [String] specifies the markers (first character of the value is the separator followed by the two parts separated by it) for marking floating phrases without start or end of sentence found
matchpos Get the feature occurencies printed
List of parameters
- match [Feature] defines the query features
- N [Numeric] (1:) the maximum number of matches to return
matchvar Extract all variables assigned to subexpressions of features specified.
List of parameters
- match [Feature] defines the query features to inspect for variable matches
- type [String] the forward index feature type for the content to extract
metadata Get the value of a document meta data element.
List of parameters
- name [Metadata] the name of the meta data element to get
scalar summarizer derived from weighting function scalar: : Calculate the document weight with a weighting scheme defined by a scalar function on metadata elements,constants and variables as arguments.
List of parameters
- function [String] defines the expression of the scalar function to execute
- metadata [Metadata] defines a meta data element as additional parameter of the function besides N (collection size). The parameter is addressed by the name of the metadata element in the expression
- [a-z]+ [Numeric] defines a variable value to be substituted in the scalar function expression
smart summarizer derived from weighting function smart: : Calculate the document weight with a weighting scheme given by a scalar function defined as expression with ff (feature frequency), df (document frequency), N (total number of documents in the collection) and some specified metadata elements as arguments. The name of this method has been inspired by the traditional SMART weighting schemes in IR
List of parameters
- match [Feature] defines the query features to weight
- function [String] defines the expression of the scalar function to execute
- metadata [Metadata] defines a meta data element as additional parameter of the function besides ff,df,qf and N. The parameter is addressed by the name of the metadata element in the expression
- [a-z]+ [Numeric] defines a variable value to be substituted in the scalar function expression

Analyzer

List of functions and operators predefined in the analyzer text processor

Segmenter

list of segmenters

cjson Segmenter for JSON (application/json) based on the cjson library for parsing json and textwolf for the xpath automaton
plain Segmenter for plain text (in one segment)
textwolf Segmenter for XML (application/xml) based on the textwolf library
tsv Segmenter for TSV (text/tab-separated-values)

Tokenizer

list of functions for tokenization

content Tokenizer producing one token for each input chunk (identity)
punctuation Tokenizer producing punctuation elements (end of sentence recognition). The language is specified as parameter (currently only german 'de' and english 'en' supported)
regex Tokenizer selecting tokens from source that are matching a regular expression.
split Tokenizer splitting tokens separated by whitespace characters
textcat Tokenizer splitting tokens by recognized language
word Tokenizer splitting tokens by word boundaries

Normalizer

list of functions for token normalization

charselect Normalizer mapping all alpha characters to identity and all other characters to nothing. The language set is passed as first argument (currently only european 'eu' and ASCII 'ascii' supported).
const Normalizer mapping input tokens to a constant string
convdia Normalizer mapping all diacritical characters to ascii. The language is passed as first argument (currently only german 'de' and english 'en' supported).
date2int Normalizer mapping a date to an integer. The granularity of the result is passed as first argument and alternative date formats as following arguments.Returns a date time difference of a date time value to a constant base date time value (e.g. '1970-01-01') as integer.The first parameter specifies the unit of the result and the constant base date time value.This unit is specified as string with the granularity (one of { 'us'=microseconds, 'ms'=milliseconds, 's'=seconds, 'm'=minutes, 'h'=hours, 'd'=days })optionally followed by the base date time value. If the base date time value is not specified, then "1970-01-01" is assumed.
dictmap normalizer mapping the elements with a dictionary. For found elements the passed value is returned. The dictionary file name is passed as argument
empty Normalizer mapping input tokens to an empty string
lc Normalizer mapping all characters to lowercase.
ngram Normalizer producing ngrams.
orig Normalizer mapping the identity of the input tokens
regex Normalizer that does a regular expression match with the first argument and a replace with the format string defined in the second argument.
stem Normalizer doing stemming based on snowball. The language is passed as parameter
text Normalizer mapping the identity of the input tokens
uc Normalizer mapping all characters to uppercase.
wordjoin Normalizer producing joined words.

Aggregator

list of functions for aggregating values after document analysis, e.g. counting of words

count Aggregator counting the input elements
maxpos Aggregator getting the maximum position of the input elements
minpos Aggregator getting the minimum position of the input elements
nextpos Aggregator getting the maximum position of the input elements
sumsquaretf aggregator for calculating the sum of the square of the tf of all selected elements
typeset aggregator building a set of features types that exist in the document (represented as bit-field)
valueset aggregator building a set of features values that exist in the document (represented as bit-field)

PatternLexer

list of lexers for pattern matching

std pattern lexer based the Intel hyperscan library

PatternMatcher

list of modules for pattern matching

std pattern matcher based on an event driven automaton