The Open-Source AI Stack
RSS

The Stack · Runtime · Open source

vLLM

Dominant open production inference engine; PagedAttention and continuous batching; NVIDIA / AMD / Intel / TPU support.

Apache 2.0 · stable · Project site → · GitHub →

vLLM is an open inference engine for large language models, originally built at UC Berkeley by Woosuk Kwon and collaborators. Apache 2.0. Now stewarded by an independent vLLM project team and hosted under the LF AI and Data foundation. The central technical contribution is PagedAttention, an attention-cache memory manager that treats GPU KV-cache as virtual memory pages, allowing high utilization under varied-length requests. vLLM matters because runtime is the layer that decides how cheaply weights become tokens. The same Llama 3 70B model can cost an order of magnitude more on a poorly-tuned runtime than a well-tuned one. vLLM has repeatedly demonstrated parity with or improvement over NVIDIA's closed TensorRT-LLM, and it works on AMD ROCm, Intel Gaudi, and increasingly TPU. The closed counterpart, TensorRT-LLM, is NVIDIA-only. Within the open set, SGLang is the leading alternative (different attention scheme, strong on structured generation); llama.cpp is the local-first sibling (CPU and Apple Silicon, not server-grade GPU). Production-ready and widely deployed. It is the default open production inference engine in 2026: shipped behind multiple hosted LLM services, used by self-hosters running 7B-70B models on commodity GPUs, and the reference engine that other open work compares itself against. The v1 architectural refresh landed in early 2025 and reset the performance baseline.

Sources

Want a follow-up? Ask the chat about vLLM in context. It will compare to siblings at the same layer and ground every claim in the wiki.

Other projects at the Runtime layer

6 siblings · ordered open first