Glossary
ColBERT
A retrieval model that produces per-token embeddings for documents and queries, then ranks by summing the maximum similarity across query tokens, more accurate than single-vector retrieval.
A retrieval architecture that produces a vector per token rather than a single vector per document. At query time it computes, for each query token, the maximum similarity against any document token, then sums those maxima. The “late interaction” name distinguishes it from cross- encoders (early interaction, more accurate but slow) and from single- vector retrievers (early aggregation, fast but coarse).
ColBERT quality sits between bi-encoders and cross-encoder rerankers, with cost much closer to bi-encoders if a specialized index (PLAID or similar) is used. It handles compositional queries and proper-noun retrieval better than single-vector dense retrieval because the per- token granularity preserves more lexical signal.
It has stayed a research-grade tool more than a production default, partly because the multi-vector databaseretrieval-memoryA datastore optimized for approximate nearest-neighbor search over high-dimensional embedding vectors, the storage substrate for most RAG and recommendation pipelines. Open full entry needs custom infrastructure that most teams skip in favor of single-vector plus reranker. The tradeoff is worth knowing when single-vector retrieval underperforms on a corpus.