vLLM is an open inference engine for large language models, originally built at UC Berkeley by Woosuk Kwon and collaborators. Apache 2.0. Now stewarded by an independent vLLM project team and hosted under the LF AI and Data foundation. The central technical contribution is PagedAttention, an attention-cache memory manager that treats GPU KV-cache as virtual memory pages, allowing high utilization under varied-length requests. vLLM matters because runtime is the layer that decides how cheaply weights become tokens. The same Llama 3 70B model can cost an order of magnitude more on a poorly-tuned runtime than a well-tuned one. vLLM has repeatedly demonstrated parity with or improvement over NVIDIA's closed TensorRT-LLM, and it works on AMD ROCm, Intel Gaudi, and increasingly TPU. The closed counterpart, TensorRT-LLM, is NVIDIA-only. Within the open set, SGLang is the leading alternative (different attention scheme, strong on structured generation); llama.cpp is the local-first sibling (CPU and Apple Silicon, not server-grade GPU). Production-ready and widely deployed. It is the default open production inference engine in 2026: shipped behind multiple hosted LLM services, used by self-hosters running 7B-70B models on commodity GPUs, and the reference engine that other open work compares itself against. The v1 architectural refresh landed in early 2025 and reset the performance baseline.
The Stack · Runtime · Open source
vLLM
Dominant open production inference engine; PagedAttention and continuous batching; NVIDIA / AMD / Intel / TPU support.
Sources
- vLLM Documentation https://docs.vllm.ai/
- Efficient Memory Management for LLM Serving with PagedAttention (Kwon et al., 2023) https://arxiv.org/abs/2309.06180
- vLLM v1 Architecture Announcement https://blog.vllm.ai/
- vLLM on GitHub https://github.com/vllm-project/vllm
- blog.vllm.ai (audit-verified) https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
Want a follow-up? Ask the chat about vLLM in context. It will compare to siblings at the same layer and ground every claim in the wiki.
Other projects at the Runtime layer
6 siblings · ordered open first
- SGLang Open source
RadixAttention plus structured generation; from the LMSYS team; gains for shared-prefix and agent workloads.
- llama.cpp Open source
Georgi Gerganov's local-first inference engine; defines the GGUF format; the on-device standard.
- Ollama Open source
Local model runner; Docker-style UX over llama.cpp; the easiest way to run open weights on your machine.
- Text Generation Inference (TGI) Open source
HuggingFace's production inference server; maintenance mode in 2026 as vLLM became the standard.
- MLC-LLM Open source
Cross-platform compilation (TVM-based); the 'LLM in your browser' or 'on your phone' standard.
- TensorRT-LLM Source available
NVIDIA's closed-runtime counterpart; fastest on NVIDIA hardware; depends on closed CUDA kernels and the proprietary TensorRT compiler.