Table of Contents

Public namespace Rowles.LeanCorpus.Analysis.Filters

Classes

Public class AccentFoldingFilter

Normalises accented/diacritic characters to their ASCII base form (e.g., é→e, ñ→n, ü→u) for language-neutral matching. Uses Unicode canonical decomposition followed by stripping combining marks.

Public class DecimalDigitFilter

Normalises Unicode decimal digits to ASCII digits.

Public class ElisionFilter

Removes configured elided articles before straight or curly apostrophes.

Public class HtmlStripCharFilter

Strips HTML/XML tags from input text, leaving only the text content.

Public class KeywordMarkerFilter

Identifies tokens that should be treated as keywords by compatible analysers.

Public class LengthFilter

Removes tokens whose text length falls outside an inclusive range.

Public class LowercaseFilter

Performs an in-place lowercase transformation on tokens or a character buffer.

Public class MappingCharFilter

Maps specific characters or strings to replacements using a lookup table. Useful for normalising special characters (e.g., smart quotes → straight quotes).

Public class PatternReplaceCharFilter

Replaces text matching a regex pattern with a replacement string.

Public class PorterStemmerFilter

Porter Stemming Algorithm implementation as an ITokenFilter. Based on the Porter 1980 specification for English stemming. Operates on tokens in-place, replacing text with stemmed form.

Public class ReverseStringFilter

Reverses the characters in each token.

Public class ShingleFilter

Emits contiguous token shingles for phrase-oriented analysis.

Public class StopWordFilter

Removes common English stop words from a token list using a frozen set for fast, allocation-free lookups.

Public class SynonymGraphFilter

Token filter that supports multi-token synonym expansion using a trie-based SynonymMap. Uses longest-match lookahead for multi-word synonyms and inserts replacement tokens at the same position offsets.

Public class SynonymMap

Trie-based synonym map supporting multi-token source phrases. Used by SynonymGraphFilter for longest-match multi-token synonym expansion.

Public class TruncateTokenFilter

Truncates token text to a maximum character length.

Public class UniqueTokenFilter

Removes duplicate tokens while preserving the first occurrence.

Public class WordDelimiterFilter

Splits compound tokens on delimiters, case changes, and letter-number boundaries.

Interfaces

Public interface ICharFilter

Interface for character-level filters that transform raw text before tokenisation. Char filters run before the tokeniser, operating on the entire input string.