Glossary

reranking

A second-pass scoring step that takes the top-k candidates from initial retrieval and rescores them with a more expensive but more accurate cross-encoder model.

Retrieval and Memory aka rerank, cross-encoder reranking

A two-stage retrieval pattern. Stage one (the “retriever”) uses a cheap method like BM25 or bi-encoder embedding similarity to fetch 50 to 200 candidates from a large corpus. Stage two (the “reranker”) uses a cross-encoder model that takes a query and a candidate together as a single input and predicts a relevance score, more expensive per candidate but much more accurate.

Reranking exists because the retrieval-quality versus latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry trade-off asymmetric. embeddingretrieval-memoryA fixed-size vector representation of a piece of text learned so semantically similar texts land near each other in the vector space, the basis for vector search and most RAG. Open full entry similarity is fast over millions of documents but imprecise. Cross-encoder rerankers are precise but too slow to apply at corpus scale. Pairing them gets near-cross-encoder quality at near-bi- encoder cost.

Open reranker models: BGE-Reranker, MixedBread mxbai-rerank, Jina Reranker, Cohere Rerank (closed API but commonly used). The MTEB benchmarkevaluationA standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default. Open full entry includes a reranker leaderboardevaluationA ranked listing of models scored on one benchmark or aggregate, with LMArena and SWE-Bench Verified as the main 2026 reference points and the Open LLM Leaderboard now archived. Open full entry ; quality has improved steadily since 2023.

Sources

Mentioned in

Back to glossary