Rowles.LeanCorpus.Analysis.Filters

Classes

AccentFoldingFilter: Normalises accented/diacritic characters to their ASCII base form (e.g., é→e, ñ→n, ü→u) for language-neutral matching. Uses Unicode canonical decomposition followed by stripping combining marks.

DecimalDigitFilter: Normalises Unicode decimal digits to ASCII digits.

ElisionFilter: Removes configured elided articles before straight or curly apostrophes.

HtmlStripCharFilter: Strips HTML/XML tags from input text, leaving only the text content.

KeywordMarkerFilter: Identifies tokens that should be treated as keywords by compatible analysers.

LengthFilter: Removes tokens whose text length falls outside an inclusive range.

LowercaseFilter: Performs an in-place lowercase transformation on tokens or a character buffer.

MappingCharFilter: Maps specific characters or strings to replacements using a lookup table. Useful for normalising special characters (e.g., smart quotes → straight quotes).

PatternReplaceCharFilter: Replaces text matching a regex pattern with a replacement string.

PorterStemmerFilter: Porter Stemming Algorithm implementation as an ITokenFilter. Based on the Porter 1980 specification for English stemming. Operates on tokens in-place, replacing text with stemmed form.

ReverseStringFilter: Reverses the characters in each token.

ShingleFilter: Emits contiguous token shingles for phrase-oriented analysis.

StopWordFilter: Removes common English stop words from a token list using a frozen set for fast, allocation-free lookups.

SynonymGraphFilter: Token filter that supports multi-token synonym expansion using a trie-based SynonymMap. Uses longest-match lookahead for multi-word synonyms and inserts replacement tokens at the same position offsets.

SynonymMap: Trie-based synonym map supporting multi-token source phrases. Used by SynonymGraphFilter for longest-match multi-token synonym expansion.

TruncateTokenFilter: Truncates token text to a maximum character length.

UniqueTokenFilter: Removes duplicate tokens while preserving the first occurrence.

WordDelimiterFilter: Splits compound tokens on delimiters, case changes, and letter-number boundaries.

Interfaces

ICharFilter: Interface for character-level filters that transform raw text before tokenisation. Char filters run before the tokeniser, operating on the entire input string.