Glossary

BM25

A classical lexical ranking function for information retrieval, based on term frequency and inverse document frequency with saturation, still the strong lexical baseline for hybrid search.

Retrieval and Memory aka Okapi BM25

A scoring function from 1990s information retrieval that scores a document against a query by summing term-frequency contributions, each weighted by inverse document frequency and saturated so very high counts do not dominate. The math is simple, the implementation is cheap, and the quality on keyword-style queries is hard to beat.

BM25 stayed relevant after dense retrieval took over because it covers the failure modes of semantic search directly. Exact identifiers, proper nouns, code function names, technical jargon: lexical search finds them; embeddingretrieval-memoryA fixed-size vector representation of a piece of text learned so semantically similar texts land near each other in the vector space, the basis for vector search and most RAG. Open full entry search often does not. Hybrid search runs both and fuses the scores.

Modern retrieval stacks (LlamaIndexretrieval-memoryAn open-source RAG framework focused on connecting LLMs to external data, with strong document-ingestion tooling and a smaller surface area than LangChain. Open full entry , LangChainagentsThe earliest widely-adopted LLM agent and RAG orchestration framework (2022), now with the LangGraph extension for stateful multi-step agent workflows. Open full entry , Haystack) make BM25 a default option alongside vector search; Elasticsearch, OpenSearch, and Lucene-based engines have BM25 built in. The work since has been on BM25-plus-rerankers and on hybrid fusion methods, not on replacing BM25 itself.

Sources

The Probabilistic Relevance Framework: BM25 and Beyond (Robertson and Zaragoza, 2009)

Mentioned in

semantic search

Back to glossary