Glossary
prefix caching
A serving optimization that stores the KV cache for shared prompt prefixes (system prompts, few-shot examples) so subsequent requests reusing them skip the prefill compute.
Most LLM traffic in production reuses long prompts. A chat product
sends the same system message on every turn. An agent sends the same
tool descriptions. A RAGretrieval-memoryA pattern where a model retrieves relevant documents from an external store at query time and conditions its answer on them, instead of relying only on parametric knowledge.
Open full entry pipeline sends the same instruction template.
Prefix caching recognizes that prefix in the KV cache and reuses it,
turning what would be a full prefill into a near-instant attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors.
Open full entry
lookup.
The cache is content-addressed by token sequence hash. A request’s prefix is matched against the cache prefix by prefix until divergence; matched blocks skip prefill compute; divergent suffix is computed normally. Eviction is usually LRU.
vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load.
Open full entry ships automatic prefix caching since 0.4; SGLang builds it into
RadixAttention; TensorRT-LLMruntimeNVIDIA's closed-source inference engine for NVIDIA GPUs, the fastest runtime on Hopper and Blackwell but tied to NVIDIA's proprietary kernel stack and CUDA.
Open full entry and TGIruntimeHugging Face's production inference server, an early peer of vLLM that ceded throughput leadership in 2024 and now sits in maintenance mode behind vLLM and SGLang.
Open full entry both support it as of 2026. The
practical effect for prefix-heavy traffic is 2x to 10x latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric.
Open full entry
reduction on time-to-first-token, depending on prefix length and cache
hit rate.