Rowles.LeanCorpus.Analysis.Filters
Classes
AccentFoldingFilter
Normalises accented/diacritic characters to their ASCII base form (e.g., é→e, ñ→n, ü→u) for language-neutral matching. Uses Unicode canonical decomposition followed by stripping combining marks.
DecimalDigitFilter
Normalises Unicode decimal digits to ASCII digits.
ElisionFilter
Removes configured elided articles before straight or curly apostrophes.
HtmlStripCharFilter
Strips HTML/XML tags from input text, leaving only the text content.
KeywordMarkerFilter
Identifies tokens that should be treated as keywords by compatible analysers.
LengthFilter
Removes tokens whose text length falls outside an inclusive range.
LowercaseFilter
Performs an in-place lowercase transformation on tokens or a character buffer.
MappingCharFilter
Maps specific characters or strings to replacements using a lookup table. Useful for normalising special characters (e.g., smart quotes → straight quotes).
PatternReplaceCharFilter
Replaces text matching a regex pattern with a replacement string.
PorterStemmerFilter
Porter Stemming Algorithm implementation as an ITokenFilter. Based on the Porter 1980 specification for English stemming. Operates on tokens in-place, replacing text with stemmed form.
ReverseStringFilter
Reverses the characters in each token.
ShingleFilter
Emits contiguous token shingles for phrase-oriented analysis.
StopWordFilter
Removes common English stop words from a token list using a frozen set for fast, allocation-free lookups.
SynonymGraphFilter
Token filter that supports multi-token synonym expansion using a trie-based SynonymMap. Uses longest-match lookahead for multi-word synonyms and inserts replacement tokens at the same position offsets.
SynonymMap
Trie-based synonym map supporting multi-token source phrases. Used by SynonymGraphFilter for longest-match multi-token synonym expansion.
TruncateTokenFilter
Truncates token text to a maximum character length.
UniqueTokenFilter
Removes duplicate tokens while preserving the first occurrence.
WordDelimiterFilter
Splits compound tokens on delimiters, case changes, and letter-number boundaries.
Interfaces
ICharFilter
Interface for character-level filters that transform raw text before tokenisation. Char filters run before the tokeniser, operating on the entire input string.