Table of Contents

Codecs

LeanCorpus stores each segment as a small set of purpose-built codec files. The files are sidecars that share the segment ID prefix, for example seg_0.dic and seg_0.pos. Every binary codec starts with the LeanCorpus magic header and a format version from CodecConstants.

Extension Codec Used for
.seg Segment metadata JSON metadata for document counts, field names, index sort metadata, delete generation, and vector field descriptors.
segments_N Commit file Atomic commit manifest listing live segment IDs, commit generation, content token, and CRC32 trailer.
.dic Term dictionary Maps sorted field\0term keys to postings offsets for term, phrase, prefix, wildcard, fuzzy, and regexp queries.
.pos Postings Block-packed document IDs, frequencies, positions, and optional payloads for inverted-index queries.
.nrm Norms Per-document field-length norms used by scoring.
.fln Field lengths Exact per-field token counts used by BM25 and segment statistics.
.fdt Stored fields data Stored field payload blocks, optionally compressed.
.fdx Stored fields index Stored field block offsets and compression metadata for random document lookup.
.num Sparse numeric index Per-field numeric values keyed by document ID for range queries and compatibility fallback.
.bkd BKD tree Point index for fast numeric range queries.
.dvn Numeric DocValues Single-valued numeric column data for sorting and aggregations.
.dvs Sorted DocValues Single-valued string ordinal columns for sorting, faceting, and collapse.
.dss Sorted-set DocValues Multi-valued string ordinal columns for repeated StringField values, used by facets and deterministic sort/collapse fallback.
.dsn Sorted-numeric DocValues Multi-valued numeric columns for repeated NumericField values, used by aggregations and deterministic numeric sort fallback.
.dvb Binary DocValues Multi-valued UTF-8 byte columns derived from stored-field payloads, used before stored-field scans for string facets and grouping fallback.
.vec Vectors Per-field dense vector payloads used by vector search.
.hnsw HNSW graph Approximate nearest-neighbour graph for vector search.
.tvd Term vectors data Optional per-document term vector payloads.
.tvx Term vectors index Term vector document offsets.
.pbs Parent bitset Parent-document markers for block-join queries.
.del / _gen_N.del Live docs Deleted-document bitsets, either legacy or generation-specific.
.stats.json Segment stats Per-segment field-length totals and document counts.
stats_N.json Index stats Commit-level corpus statistics used by searchers.

Optional and generated files

write.lock prevents multiple writers from mutating an index at the same time. Temporary *.tmp files can appear during atomic writes and are cleaned up during writer-side recovery.

Stored field compression is configured through FieldCompressionPolicy. The codec records the chosen policy in .fdx, while compression implementations can publish native sidecar binaries when an application is published as Native AOT.

IndexValidator.Check and leancorpus-cli.exe check validate the codec headers for the segment files above, including DocValues, vector, HNSW, term-vector, and live-doc sidecars when they are present. Deep validation opens the reader paths for postings, stored fields, DocValues, vectors, HNSW graphs, and live docs.

See Reliable commits for commit and recovery details.