This document is the output of strusHelp --html -m analyzer_pattern -m storage_vector_std
Strus built-in functions
Query Processor
List of functions and operators predefined in the storage query processor
Posting join operator
List of posting join operators
- chain Get the set of postings (d,p) that exist in the first argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| <= |rj| for i
- chain_struct Get the set of postings (d,p) that exist in the second argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| <= |rj| for i
2. Additionally there must not exist a posting in the first argument set that is overlapped by the interval formed by the other argument postings - contains Get the set of postings (d,1) for documents d that contain all of the argument features
- diff Get the set of postings (d,p) that are in the first argument set but not in the second
- inrange Get the set of postings (d,p) that exist in any argument set and (d,p+r) exist in all other argument sets with |r| <= |range|
- inrange_struct Get the set of postings (d,p) that exist in any argument set and (d,p+r) exist in all other argument sets with |r| <= |range|. Additionally there must not exist a posting in the first argument set that is overlapped by the interval formed by the other argument postings.
- intersect Get the set of postings (d,p) that are occurring in all argument sets
- pred Get the set of postings (d,p-1) for all (d,p) with p>1 in the argument set
- sequence Get the set of postings (d,p) that exist in the first argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| < |rj| for i
- sequence_imm Get the set of postings (d,p) that exist in the first argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri|+1 == |rj| for i
- sequence_struct Get the set of postings (d,p) that exist in the second argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| < |rj| for i
2. Additionally there must not exist a posting in the first argument set that is overlapped by the interval formed by the other argument postings - succ Get the set of postings (d,p+1) for all (d,p) in the argument set
- union Get the set of postings that are occurring in any argument set
- within Get the set of postings (d,p) that exist in any argument set and distinct (d,p+r) exist in all other argument sets with |r| <= |range|
- within_struct Get the set of postings (d,p) that exist in any argument set and distinct (d,p+r) exist in all other argument sets with |r| <= |range|. Additionally there must not exist a posting in the first argument set that is overlapped by the interval formed by the other argument postings.
- chain_struct Get the set of postings (d,p) that exist in the second argument set and (d,p+ri) exist in the argument set i with |ri| <= |range| and |ri| <= |rj| for i
Weighting function
List of query evaluation weighting functions
- bm25 Calculate the document weight with the weighting scheme "BM25"
List of parameters
- match [Feature] defines the query features to weight
- k1 [Numeric] (1:1000) parameter of the BM25 weighting scheme
- b [Numeric] (0.0001:1000) parameter of the BM25 weighting scheme
- avgdoclen [Numeric] (0:) the average document lenght
- metadata_doclen [Metadata] the meta data element name referencing the document lenght for each document weighted
- bm25pff Calculate the document weight with the weighting scheme "BM25pff". This is "BM25" where the feature frequency is counted by 1.0 per feature only for features with the maximum proximity score. The proximity score is a measure that takes the proximity of other query features into account
List of parameters
- match [Feature] defines the query features to weight
- struct [Feature] defines the delimiter for structures
- para [Feature] defines the delimiter for paragraphs (windows used for proximity weighting must not overlap paragraph borders)
- title [Feature] defines the title field (used for weighting increment of features appearing in title)
- k1 [Numeric] (1:1000) parameter of the BM25pff weighting scheme
- b [Numeric] (0.0001:1000) parameter of the BM25pff weighting scheme
- titleinc [Numeric] (0.0:) ff increment for title features
- cprop [Numeric] (0.0:1.0) constant part of idf proportional feature weight
- paragraphsize [Numeric] the estimated size of a paragraph
- sentencesize [Numeric] the estimated size of a sentence
- windowsize [Numeric] the size of the window used for finding features to increment proximity scores
- cardinality [Numeric] the number of query features a proximity score window must contain to be considered (optional, default is all features, percentage of input features specified with '%' suffix)
- ffbase [Numeric] (0.0:1.0) value in the range from 0.0 to 1.0 specifying the percentage of the constant score on the proximity ff for every feature occurrence. (with 1.0 the scheme is plain BM25)
- avgdoclen [Numeric] (0:) the average document lenght
- maxdf [Numeric] (0:) the maximum df as fraction of the collection size
- metadata_doclen [Metadata] the meta data element name referencing the document lenght for each document weighted
- constant Calculate the weight of a document as sum of the the feature weights of the occurring features
List of parameters
- match [Feature] defines the query features to weight
- metadata Calculate the weight of a document as value of a meta data element.
List of parameters
- name [Metadata] name of the meta data element to use as weight
- scalar Calculate the document weight with a weighting scheme defined by a scalar function on metadata elements,constants and variables as arguments.
List of parameters
- function [String] defines the expression of the scalar function to execute
- metadata [Metadata] defines a meta data element as additional parameter of the function besides N (collection size). The parameter is addressed by the name of the metadata element in the expression
- [a-z]+ [Numeric] defines a variable value to be substituted in the scalar function expression
- smart Calculate the document weight with a weighting scheme given by a scalar function defined as expression with ff (feature frequency), df (document frequency), N (total number of documents in the collection) and some specified metadata elements as arguments. The name of this method has been inspired by the traditional SMART weighting schemes in IR
List of parameters
- match [Feature] defines the query features to weight
- function [String] defines the expression of the scalar function to execute
- metadata [Metadata] defines a meta data element as additional parameter of the function besides ff,df,qf and N. The parameter is addressed by the name of the metadata element in the expression
- [a-z]+ [Numeric] defines a variable value to be substituted in the scalar function expression
- tf Calculate the weight of a document as sum of the feature frequency of a feature multiplied with the feature weight
List of parameters
- match [Feature] defines the query features to weight
- weight [Numeric] (0:) defines the query feature weight factor
Summarizer
List of summarization functions for the presentation of a query evaluation result
- accunear Extract and weight all elements in the forward index of a given type that are within a window with features specified.
List of parameters
- match [Feature] defines the query features to inspect for near matches
- struct [Feature] defines a structural delimiter for interaction of features on the same result
- type [String] the forward index feature type for the content to extract
- result [String] the name of the result if not equal to type
- cofactor [Numeric] multiplication factor for features pointing to the same result
- norm [Numeric] normalization factor for end result weights
- nofranks [Numeric] maximum number of ranks per document
- cardinality [Numeric] mimimum number of features per weighted item
- range [Numeric] maximum distance (ordinal position) of the weighted features (window size)
- cprop [Numeric] (0.0:1.0) constant part of idf proportional feature weight
- accuvar Accumulate the weights of all contents of a variable in matching expressions. Weights with same positions are grouped and multiplied, the group results are added to the sum, the total weight assigned to the variable content.
List of parameters
- match [Feature] defines the query features to inspect for variable matches
- type [String] the forward index feature type for the content to extract
- var [String] the name of the variable referencing the content to weight
- nof [Numeric] (1:) the maximum number of the best weighted elements to return (default 10)
- norm [Numeric] (0.0:1.0) the normalization factor of the calculated weights (default 1.0)
- cofactor [Numeric] (0.0:) additional multiplier for coincident matches (default 1.0)
- attribute Get the value of a document attribute.
List of parameters
- name [Attribute] the name of the attribute to get
- forwardindex Get the complete forward index
List of parameters
- type [String] the forward index type to fetch the summary elements
- name [String] the name of the result attribute (default is the value of 'type'')
- N [Numeric] (1:) the maximum number of matches to return
- matchphrase Get best matching phrases delimited by the structure postings
List of parameters
- match [Feature] defines the features to weight
- struct [Feature] defines the delimiter for structures
- para [Feature] defines the delimiter for paragraphs (summaries must not overlap paragraph borders)
- title [Feature] defines the title field of documents
- type [String] the forward index type of the result phrase elements
- paragraphsize [Numeric] (1:) estimated size of a paragraph
- sentencesize [Numeric] (1:) estimated size of a sentence, also a restriction for the maximum length of sentences in summaries
- windowsize [Numeric] (1:) maximum size of window used for identifying matches
- cardinality [Numeric] (1:) minimum number of features in a window
- maxdf [Numeric] (1:) the maximum df (fraction of collection size) of features considered for same sentence proximity weighing
- matchmark [String] specifies the markers (first character of the value is the separator followed by the two parts separated by it) for highlighting matches in the resulting phrases
- floatingmark [String] specifies the markers (first character of the value is the separator followed by the two parts separated by it) for marking floating phrases without start or end of sentence found
- matchpos Get the feature occurencies printed
List of parameters
- match [Feature] defines the query features
- N [Numeric] (1:) the maximum number of matches to return
- matchvar Extract all variables assigned to subexpressions of features specified.
List of parameters
- match [Feature] defines the query features to inspect for variable matches
- type [String] the forward index feature type for the content to extract
- metadata Get the value of a document meta data element.
List of parameters
- name [Metadata] the name of the meta data element to get
- scalar summarizer derived from weighting function scalar: : Calculate the document weight with a weighting scheme defined by a scalar function on metadata elements,constants and variables as arguments.
List of parameters
- function [String] defines the expression of the scalar function to execute
- metadata [Metadata] defines a meta data element as additional parameter of the function besides N (collection size). The parameter is addressed by the name of the metadata element in the expression
- [a-z]+ [Numeric] defines a variable value to be substituted in the scalar function expression
- smart summarizer derived from weighting function smart: : Calculate the document weight with a weighting scheme given by a scalar function defined as expression with ff (feature frequency), df (document frequency), N (total number of documents in the collection) and some specified metadata elements as arguments. The name of this method has been inspired by the traditional SMART weighting schemes in IR
List of parameters
- match [Feature] defines the query features to weight
- function [String] defines the expression of the scalar function to execute
- metadata [Metadata] defines a meta data element as additional parameter of the function besides ff,df,qf and N. The parameter is addressed by the name of the metadata element in the expression
- [a-z]+ [Numeric] defines a variable value to be substituted in the scalar function expression
Analyzer
List of functions and operators predefined in the analyzer text processor
Segmenter
list of segmenters
- cjson Segmenter for JSON (application/json) based on the cjson library for parsing json and textwolf for the xpath automaton
- plain Segmenter for plain text (in one segment)
- textwolf Segmenter for XML (application/xml) based on the textwolf library
- tsv Segmenter for TSV (text/tab-separated-values)
Tokenizer
list of functions for tokenization
- content Tokenizer producing one token for each input chunk (identity)
- punctuation Tokenizer producing punctuation elements (end of sentence recognition). The language is specified as parameter (currently only german 'de' and english 'en' supported)
- regex Tokenizer selecting tokens from source that are matching a regular expression.
- split Tokenizer splitting tokens separated by whitespace characters
- textcat Tokenizer splitting tokens by recognized language
- word Tokenizer splitting tokens by word boundaries
Normalizer
list of functions for token normalization
- charselect Normalizer mapping all alpha characters to identity and all other characters to nothing. The language set is passed as first argument (currently only european 'eu' and ASCII 'ascii' supported).
- const Normalizer mapping input tokens to a constant string
- convdia Normalizer mapping all diacritical characters to ascii. The language is passed as first argument (currently only german 'de' and english 'en' supported).
- date2int Normalizer mapping a date to an integer. The granularity of the result is passed as first argument and alternative date formats as following arguments.Returns a date time difference of a date time value to a constant base date time value (e.g. '1970-01-01') as integer.The first parameter specifies the unit of the result and the constant base date time value.This unit is specified as string with the granularity (one of { 'us'=microseconds, 'ms'=milliseconds, 's'=seconds, 'm'=minutes, 'h'=hours, 'd'=days })optionally followed by the base date time value. If the base date time value is not specified, then "1970-01-01" is assumed.
- dictmap normalizer mapping the elements with a dictionary. For found elements the passed value is returned. The dictionary file name is passed as argument
- empty Normalizer mapping input tokens to an empty string
- lc Normalizer mapping all characters to lowercase.
- ngram Normalizer producing ngrams.
- orig Normalizer mapping the identity of the input tokens
- regex Normalizer that does a regular expression match with the first argument and a replace with the format string defined in the second argument.
- stem Normalizer doing stemming based on snowball. The language is passed as parameter
- text Normalizer mapping the identity of the input tokens
- uc Normalizer mapping all characters to uppercase.
- wordjoin Normalizer producing joined words.
Aggregator
list of functions for aggregating values after document analysis, e.g. counting of words
- count Aggregator counting the input elements
- maxpos Aggregator getting the maximum position of the input elements
- minpos Aggregator getting the minimum position of the input elements
- nextpos Aggregator getting the maximum position of the input elements
- sumsquaretf aggregator for calculating the sum of the square of the tf of all selected elements
- typeset aggregator building a set of features types that exist in the document (represented as bit-field)
- valueset aggregator building a set of features values that exist in the document (represented as bit-field)
PatternLexer
list of lexers for pattern matching
- std pattern lexer based the Intel hyperscan library
PatternMatcher
list of modules for pattern matching
- std pattern matcher based on an event driven automaton