Glossary
BM25
A classical lexical ranking function for information retrieval, based on term frequency and inverse document frequency with saturation, still the strong lexical baseline for hybrid search.
A scoring function from 1990s information retrieval that scores a document against a query by summing term-frequency contributions, each weighted by inverse document frequency and saturated so very high counts do not dominate. The math is simple, the implementation is cheap, and the quality on keyword-style queries is hard to beat.
BM25 stayed relevant after dense retrieval took over because it covers
the failure modes of semantic search directly. Exact identifiers,
proper nouns, code function names, technical jargon: lexical search
finds them; embeddingretrieval-memoryA fixed-size vector representation of a piece of text learned so semantically similar texts land near each other in the vector space, the basis for vector search and most RAG.
Open full entry search often does not. Hybrid search runs both
and fuses the scores.
Modern retrieval stacks (LlamaIndexretrieval-memoryAn open-source RAG framework focused on connecting LLMs to external data, with strong document-ingestion tooling and a smaller surface area than LangChain. Open full entry , LangChainagentsThe earliest widely-adopted LLM agent and RAG orchestration framework (2022), now with the LangGraph extension for stateful multi-step agent workflows. Open full entry , Haystack) make BM25 a default option alongside vector search; Elasticsearch, OpenSearch, and Lucene-based engines have BM25 built in. The work since has been on BM25-plus-rerankers and on hybrid fusion methods, not on replacing BM25 itself.