Glossary

PagedAttention

An attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently.

Runtime aka paged attention

The core technical contribution of vLLM. Conventional inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry engines allocate one contiguous slab of GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry memory per request to hold the KV cache, sized for the maximum possible sequence length. Most requests do not use all of it; the unused tail is wasted; bringing in the next request requires defragmentation. The paged design splits the KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry into fixed-size blocks (typically 16 tokens each) tracked by a page table per request, the way virtual memory works in an OS.

The result: near-zero memory fragmentation, dynamic growth of each request’s cache as it generates more tokens, and the ability to share identical prefix blocks across requests (which is the basis for prefix caching).

PagedAttention enables continuous batchingruntimeA request-scheduling pattern where the inference engine adds new requests to the running batch as soon as one finishes a token, instead of waiting for the whole batch to complete. Open full entry at high concurrency. Other runtimes adopted the pattern after the original vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry release: SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry extends it with RadixAttention for tree-structured prefix sharing, and TensorRT-LLMruntimeNVIDIA's closed-source inference engine for NVIDIA GPUs, the fastest runtime on Hopper and Blackwell but tied to NVIDIA's proprietary kernel stack and CUDA. Open full entry ships its own block-based KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry manager.

Sources

Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., SOSP 2023)

Mentioned in

Back to glossary