Runtime

What it is

Overview

The engines that take a transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry model and serve tokens efficiently on actual hardware. Runtime is the layer most invisible to end users (you don’t see it from a chat UI) but most decisive for production economics: the same model can cost ten times more on a poorly-tuned runtime than a well-tuned one.

Five things to keep in mind as you read:

Runtime decides what serving actually costs. The model doesn’t change; the engine that runs it does.
Two ecosystems split the layer. Server-class (many users per GPU, datacenter hardware) and local-first (one user, Mac or consumer GPU).
vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry is the server-class default for open serving. SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry and TGI compete; TensorRT-LLMruntimeNVIDIA's closed-source inference engine for NVIDIA GPUs, the fastest runtime on Hopper and Blackwell but tied to NVIDIA's proprietary kernel stack and CUDA. Open full entry is the closed equivalent.
llama.cppruntimeGeorgi Gerganov's C++ inference engine optimized for CPUs and consumer GPUs, the on-device standard and the engine behind Ollama, LM Studio, and most local-first AI products. Open full entry is the local-first default. OllamaruntimeA local inference runtime that wraps llama.cpp with a Docker-style developer experience, the easiest path to running open-weight models on a personal machine. Open full entry wraps it; MLXruntimeApple's open-source ML framework designed for Apple Silicon's unified memory architecture, the local-first inference engine for Mac and increasingly iPad and iPhone. Open full entry is the Apple-native variant.
The open side leads on runtime research. PagedAttentionruntimeAn attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently. Open full entry , RadixAttentionruntimeA KV cache management scheme used by SGLang that organizes shared prompt prefixes as a radix tree, letting many requests with overlapping prefixes reuse cached attention state. Open full entry , speculative decodingruntimeAn inference acceleration technique where a small fast draft model proposes several tokens at once and the target model verifies them in parallel, giving 2-3x speedup with no quality loss. Open full entry all emerged here first.

The rest of this page walks the two ecosystems and then arrives at the asymmetry between technical merits and enterprise adoption.

Server-class runtime

What runs a model on datacenter hardware for many concurrent users. Optimized for throughput, batching efficiency, and hardware utilization.

vLLM is the de facto open default. Originated at UC Berkeley in 2023 with the PagedAttention paper (Kwon et al., 2023), which treats the KV cache as paged memory the way operating systems treat process memory. The result is significantly higher GPU utilization on concurrent inference than the naive implementation. The vLLM project has since grown into a broader serving framework with continuous batchingruntimeA request-scheduling pattern where the inference engine adds new requests to the running batch as soon as one finishes a token, instead of waiting for the whole batch to complete. Open full entry , speculative decoding, and Multi-LoRA inferenceruntimeServing many LoRA adapters concurrently on a single base model, with the runtime swapping the right adapter in per request rather than loading separate fine-tuned copies. Open full entry . Most open serving today runs on vLLM under the hood (vLLM repository).

SGLang is the credible alternative. Built around RadixAttention (prefix-tree-based KV cache sharing across requests with overlapping prompts) and a structured-generation DSL that constrains output to a grammar (SGLang paper, NeurIPS 2024). SGLang wins for workloads with high prompt-prefix overlap (agent traces with a shared system prompt) and for cases where you need guaranteed JSON output without sampling-and-retry.

TGI (Text Generation Inference, HuggingFace) was the production-grade open engine before vLLM ate its share. Still used in the HuggingFace ecosystem but no longer the default open recommendation (TGI repository).

TensorRT-LLM (NVIDIA) is the closed equivalent. NVIDIA-optimized kernels, tightest integration with H100/H200/Blackwell, used by the hyperscaler-hosted endpoints and many enterprise deployments (TensorRT-LLM repository). Performance-competitive with vLLM but not portable off NVIDIA.

Local-first runtime

The other half of the layer. Optimized for one user on a Mac or consumer GPU.

llama.cpp is the foundation. C++ implementation, GGUF quantization format (4-bit and 5-bit being the popular trade-off points), runs efficiently on Apple Silicon, x86 CPUs, and consumer NVIDIA / AMD GPUs. The reason “Llama-class models on a laptop” became normal in 2024-2026 (llama.cpp repository).

Ollama wraps llama.cpp in a “docker run” style CLI: pulls a model from a registry, runs it locally, exposes a simple HTTP API. Made local model inference accessible to people who didn’t want to deal with llama.cpp’s command-line flags (Ollama project).

MLX is Apple’s native ML framework for Apple Silicon. Tighter integration with the M-series unified memory than llama.cpp’s Metal backend, and increasingly the engine of choice for serious local-AI work on Macs (MLX repository).

LM Studio is the consumer GUI layer over llama.cpp. Not a runtime per se, but it’s how non-developers run local models on Windows and Mac.

The runtime-research pattern

A consistent observation across 2023-2026: the major runtime innovations emerge in open research first and get absorbed by closed runtimes a quarter or two later, not the reverse.

PagedAttention (open, vLLM, 2023) → adopted by closed serving infrastructure within months. FlashAttention (open, Stanford, 2022; FlashAttention-2 in 2023; FlashAttention-3 in 2024) → absorbed into TensorRT-LLM and into the hyperscaler-hosted endpoints. Speculative decoding (research papers, open implementations, then commercial adoption). RadixAttention (open, SGLang, 2024) → adopted by other engines.

The pattern matters because it reverses the usual default (“closed labs do the hard work, open catches up”). At the runtime layer the open ecosystem leads. That’s partly because the work happens at universities and partly because runtime optimizations are operator-replicable in a way that billion-dollar pretraining runs are not.

What’s open and what isn’t

Most of the runtime stack is open source.

Open server-class: vLLM, SGLang, TGI, llamafile, Aphrodite Engine, MII (Microsoft, open). Cover most production open-serving deployments.
Closed server-class: TensorRT-LLM, the proprietary inference engines inside Bedrock / Vertex / Together / Anyscale / Fireworks managed offerings.
Open local-first: llama.cpp, Ollama, MLX, LM Studio (UI), Jan, GPT4All.
Closed local-first: almost nothing. The local-first category is essentially open by default because there’s no enterprise revenue to defend.

The reverse-lock-in risk is at the managed-service layer. A team that deploys to Bedrock’s Claude or Together’s hosted DeepSeek isn’t using the open runtime even when the model is open-weights; they’re using a closed managed service whose runtime details are opaque. Portability across managed services is workable (the OpenAI-compatible API has become the de facto interchange format) but full sovereignty over the runtime requires self-hosting.

The editorial tension

Two observations that pull in different directions.

On technical merits, open runtime wins. The PagedAttention → RadixAttention → speculative-decoding lineage shows the open side leading the research and the closed runtimes catching up. For a sophisticated team that wants to optimize its own inference stack, vLLM plus SGLang plus llama.cpp is the strongest combination available.

On enterprise adoption, closed runtime still dominates by revenue. Most production AI inference billable to enterprise customers routes through Bedrock, Vertex AI, OpenAI’s API, or Anthropic’s API. The technical superiority of open runtime doesn’t translate into market share because enterprise procurement values “one throat to choke” over “lowest per-token cost”, and managed services bundle the runtime with the billing, the SLA, and the legal indemnification.

The strategic question for an open-AI advocate is whether the open runtime’s technical lead eventually breaks through to enterprise procurement, or whether the closed managed-service layer remains a permanent moat regardless of what’s underneath it. As of 2026, the second outcome looks more likely than the first, which is why open runtime advocates increasingly focus on enabling sovereign-state and individual-scale deployments rather than competing for hyperscaler API revenue head-on.

Key projects

16 catalogued · ordered open first · "Details" for a project page, "Ask" for an in-context chat

vLLM Open source

Dominant open production inference engine; PagedAttention and continuous batching; NVIDIA / AMD / Intel / TPU support.

Apache 2.0 · stable · GitHub · Details →
SGLang Open source

RadixAttention plus structured generation; from the LMSYS team; gains for shared-prefix and agent workloads.

Apache 2.0 · stable · GitHub · Details →
llama.cpp Open source

Georgi Gerganov's local-first inference engine; defines the GGUF format; the on-device standard.

MIT · stable · GitHub · Details →
Ollama Open source

Local model runner; Docker-style UX over llama.cpp; the easiest way to run open weights on your machine.

MIT · stable · GitHub · Details →
Text Generation Inference (TGI) Open source

HuggingFace's production inference server; maintenance mode in 2026 as vLLM became the standard.

Apache 2.0 · maintenance · GitHub
MLC-LLM Open source

Cross-platform compilation (TVM-based); the 'LLM in your browser' or 'on your phone' standard.

Apache 2.0 · beta · GitHub
Petals Open source

Cooperative pipeline-parallel inference across volunteer devices; the canonical decentralized-inference reference project.

MIT · research · GitHub
MLX (Apple) Open source

Apple's open ML framework for Apple Silicon; enables local-first model running and small-cluster pipeline inference.

MIT · stable · GitHub
Outlines Open source

Structured-output library using finite-state-machine guided decoding to force model output to match a regex, JSON Schema, or grammar.

Apache 2.0 · stable · GitHub · Details →
XGrammar Open source

Fast grammar-based constrained-decoding engine, used as a structured-output backend by several open serving engines.

Apache 2.0 · stable · GitHub · Details →
llguidance Open source

Low-level Rust constrained-decoding engine that enforces grammars and JSON Schema at high throughput; powers Guidance.

MIT · stable · GitHub · Details →
Guidance Open source

A programming model for constrained generation that interleaves control flow with model calls and enforces regex, grammars, and JSON Schema.

MIT · stable · GitHub · Details →
Instructor Open source

Library for getting Pydantic-typed structured output from LLMs across many providers, built on function calling and JSON mode with automatic retries.

MIT · stable · GitHub · Details →
LM Format Enforcer Open source

Token-filtering library that enforces a JSON Schema or regex during generation; integrates with Transformers, vLLM, and llama.cpp.

MIT · stable · GitHub · Details →
TensorRT-LLM Source available

NVIDIA's closed-runtime counterpart; fastest on NVIDIA hardware; depends on closed CUDA kernels and the proprietary TensorRT compiler.

Source-available · stable
Apple Silicon (M-series) Proprietary

Unified memory architecture; closed silicon, but the strongest on-device inference platform via llama.cpp and MLX.

Proprietary · stable · Details →

Reading list

12 hand-picked · primary sources first · recent within each group

Papers

Speculative Decoding: A Survey 2024

Various

Why your tokens per second keeps going up without bigger GPUs. Survey-quality overview.
Efficient Memory Management for Large Language Model Serving with PagedAttention 2023

vLLM team (Woosuk Kwon et al.)

The vLLM paper. PagedAttention is the single most important runtime idea of the past three years.
SGLang: Efficient Execution of Structured Language Model Programs 2023

LMSYS / SGLang team

RadixAttention and structured generation. Read after vLLM to see the design space.
Lost in the Middle: How Language Models Use Long Contexts 2023

Liu et al.

Why long context isn't free and retrieval still wins for production systems.
Petals: Collaborative Inference of Large Models 2022

BigScience / Petals team

Pipeline-parallel inference across consumer GPUs over the public internet. Read for the latency-vs-sovereignty trade analysis.

Posts

LLM Inference Engines (Ahmad Osman, 2026) 2026

@TheAhmadOsman on X

Decision-framework piece on inference engines. The four engine families (portable, Apple, consumer CUDA, production serving), the one-page decision guide, the bottleneck taxonomy. Anchors the self-host learn track's inference-engines module.
GPU Memory Math for LLMs (Ahmad Osman, 2026) 2026

@TheAhmadOsman on X

The VRAM ≈ parameters × (bits ÷ 8) formula with worked tables across FP16, FP8, and 4-bit. Covers the VRAM tax (KV cache, activations, framework overhead) and MoE gotchas.
vLLM V1: A Major Upgrade to vLLM Architecture 2025

vLLM team

The architecture rewrite that landed in early 2025. Reset the perf baseline; read alongside the original PagedAttention paper to see what changed.
Groq vs. NVIDIA: The Inference Speed Question 2024

Semianalysis

Inference-specialty silicon (Groq LPU, Cerebras) vs. NVIDIA's all-purpose accelerators. Where each wins.
Anyscale's Inference-Cost Notes 2024

Anyscale (Robert Nishihara et al.)

Production-honest analysis of inference economics. Filter to the runtime-cost posts.

Docss

llama.cpp: Inference of LLaMA Model in Pure C/C++ 2025

Georgi Gerganov (project README)

Read the README and CONTRIBUTING. The local-first runtime philosophy lives here.
GGUF File Format Spec 2024

ggml/llama.cpp project

The de facto local-first weights format. Read this if you want to understand how local inference distributes models.