vLLM

vLLM is an open inference engine for large language models, originally built at UC Berkeley by Woosuk Kwon and collaborators. Apache 2.0. Now stewarded by an independent vLLM project team and hosted under the LF AI and Data foundation. The central technical contribution is PagedAttention, an attention-cache memory manager that treats GPU KV-cache as virtual memory pages, allowing high utilization under varied-length requests. vLLM matters because runtime is the layer that decides how cheaply weights become tokens. The same Llama 3 70B model can cost an order of magnitude more on a poorly-tuned runtime than a well-tuned one. vLLM has repeatedly demonstrated parity with or improvement over NVIDIA's closed TensorRT-LLM, and it works on AMD ROCm, Intel Gaudi, and increasingly TPU. The closed counterpart, TensorRT-LLM, is NVIDIA-only. Within the open set, SGLang is the leading alternative (different attention scheme, strong on structured generation); llama.cpp is the local-first sibling (CPU and Apple Silicon, not server-grade GPU). Production-ready and widely deployed. It is the default open production inference engine in 2026: shipped behind multiple hosted LLM services, used by self-hosters running 7B-70B models on commodity GPUs, and the reference engine that other open work compares itself against. The v1 architectural refresh landed in early 2025 and reset the performance baseline.

Other projects at the Runtime layer

12 siblings · ordered open first

SGLang Open source

RadixAttention plus structured generation; from the LMSYS team; gains for shared-prefix and agent workloads.

llama.cpp Open source

Georgi Gerganov's local-first inference engine; defines the GGUF format; the on-device standard.

Ollama Open source

Local model runner; Docker-style UX over llama.cpp; the easiest way to run open weights on your machine.

Text Generation Inference (TGI) Open source

HuggingFace's production inference server; maintenance mode in 2026 as vLLM became the standard.

MLC-LLM Open source

Cross-platform compilation (TVM-based); the 'LLM in your browser' or 'on your phone' standard.

Outlines Open source

Structured-output library using finite-state-machine guided decoding to force model output to match a regex, JSON Schema, or grammar.

XGrammar Open source

Fast grammar-based constrained-decoding engine, used as a structured-output backend by several open serving engines.

llguidance Open source

Low-level Rust constrained-decoding engine that enforces grammars and JSON Schema at high throughput; powers Guidance.

Guidance Open source

A programming model for constrained generation that interleaves control flow with model calls and enforces regex, grammars, and JSON Schema.

Instructor Open source

Library for getting Pydantic-typed structured output from LLMs across many providers, built on function calling and JSON mode with automatic retries.

LM Format Enforcer Open source

Token-filtering library that enforces a JSON Schema or regex during generation; integrates with Transformers, vLLM, and llama.cpp.

TensorRT-LLM Source available

NVIDIA's closed-runtime counterpart; fastest on NVIDIA hardware; depends on closed CUDA kernels and the proprietary TensorRT compiler.

Sources

Other projects at the Runtime layer