Glossary
vLLM
An open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load.
A Python inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry engine for LLMs whose core technical contribution is PagedAttentionruntimeAn attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently. Open full entry , a virtual-memory-style layout for the KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry that lets a server pack many concurrent requests into the same GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry memory without fragmentation. The engine handles continuous batchingruntimeA request-scheduling pattern where the inference engine adds new requests to the running batch as soon as one finishes a token, instead of waiting for the whole batch to complete. Open full entry , prefix cachingruntimeA serving optimization that stores the KV cache for shared prompt prefixes (system prompts, few-shot examples) so subsequent requests reusing them skip the prefill compute. Open full entry , speculative decodingruntimeAn inference acceleration technique where a small fast draft model proposes several tokens at once and the target model verifies them in parallel, giving 2-3x speedup with no quality loss. Open full entry , and tensor / pipeline parallelism out of the box. It exposes an OpenAI-compatible HTTP API so existing clients work unchanged.
How it differs from sibling runtimes: SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry adds RadixAttentionruntimeA KV cache management scheme used by SGLang that organizes shared prompt prefixes as a radix tree, letting many requests with overlapping prefixes reuse cached attention state. Open full entry for structured generation patterns; TensorRT-LLMruntimeNVIDIA's closed-source inference engine for NVIDIA GPUs, the fastest runtime on Hopper and Blackwell but tied to NVIDIA's proprietary kernel stack and CUDA. Open full entry ships closed-source NVIDIA kernels optimized for Hopper and Blackwell; llama.cppruntimeGeorgi Gerganov's C++ inference engine optimized for CPUs and consumer GPUs, the on-device standard and the engine behind Ollama, LM Studio, and most local-first AI products. Open full entry targets CPU and small-GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry local-firstsovereignty-decentralizationAn architecture stance where inference (and increasingly memory and agent state) runs on the user's own device rather than a remote API, prioritizing privacy, latency, and offline operation. Open full entry deployments. vLLM sits in the middle, optimized for production serving on datacenter GPUs with broad model support across both dense and mixture of expertsweightsA model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute. Open full entry families.
In open-source AI it is the default reference engine for serving open weights at scale. Most published LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry , QwenweightsAlibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters. Open full entry , MistralweightsA French open-weight model family from Mistral AI, released mostly under Apache 2.0 with strong performance per parameter and notable MoE variants (Mixtral, Mixtral 8x22B). Open full entry , and DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry benchmarks list vLLM throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry numbers; most agent platforms that need self-hosted inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry target the vLLM HTTP API. The project is permissively licensed (Apache 2.0governanceA permissive open-source license used by most open-weight model releases (Llama from 4 onward partial, Qwen, Mistral, DeepSeek, Falcon), allowing commercial use without acceptable-use restrictions. Open full entry ) and joined the PyTorch Foundation as a hosted project in May 2025.
Adjacent concepts in the runtime layer: PagedAttention (its core
mechanism), KV cache (the memory it manages), continuous batching
(its scheduling pattern), speculative decoding (an optional
acceleration), and SGLang / TensorRT-LLM (sibling engines).