The Open-Source AI Stack
RSS

The Stack · Core layer

Runtime

Inference engines that serve tokens from weights.

What it is

Overview

The engines that take a transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry model and serve tokens efficiently on actual hardware. Runtime is the layer most invisible to end users (you don’t see it from a chat UI) but most decisive for production economics: the same model can cost ten times more on a poorly-tuned runtime than a well-tuned one.

Five things to keep in mind as you read:

  • Runtime decides what serving actually costs. The model doesn’t change; the engine that runs it does.
  • Two ecosystems split the layer. Server-class (many users per GPU, datacenter hardware) and local-first (one user, Mac or consumer GPU).
  • vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry is the server-class default for open serving. SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry and TGI compete; TensorRT-LLMruntimeNVIDIA's closed-source inference engine for NVIDIA GPUs, the fastest runtime on Hopper and Blackwell but tied to NVIDIA's proprietary kernel stack and CUDA. Open full entry is the closed equivalent.
  • llama.cppruntimeGeorgi Gerganov's C++ inference engine optimized for CPUs and consumer GPUs, the on-device standard and the engine behind Ollama, LM Studio, and most local-first AI products. Open full entry is the local-first default. OllamaruntimeA local inference runtime that wraps llama.cpp with a Docker-style developer experience, the easiest path to running open-weight models on a personal machine. Open full entry wraps it; MLXruntimeApple's open-source ML framework designed for Apple Silicon's unified memory architecture, the local-first inference engine for Mac and increasingly iPad and iPhone. Open full entry is the Apple-native variant.
  • The open side leads on runtime research. PagedAttentionruntimeAn attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently. Open full entry , RadixAttentionruntimeA KV cache management scheme used by SGLang that organizes shared prompt prefixes as a radix tree, letting many requests with overlapping prefixes reuse cached attention state. Open full entry , speculative decodingruntimeAn inference acceleration technique where a small fast draft model proposes several tokens at once and the target model verifies them in parallel, giving 2-3x speedup with no quality loss. Open full entry all emerged here first.

The rest of this page walks the two ecosystems and then arrives at the asymmetry between technical merits and enterprise adoption.

Server-class runtime

What runs a model on datacenter hardware for many concurrent users. Optimized for throughput, batching efficiency, and hardware utilization.

vLLM is the de facto open default. Originated at UC Berkeley in 2023 with the PagedAttention paper (Kwon et al., 2023), which treats the KV cache as paged memory the way operating systems treat process memory. The result is significantly higher GPU utilization on concurrent inference than the naive implementation. The vLLM project has since grown into a broader serving framework with continuous batchingruntimeA request-scheduling pattern where the inference engine adds new requests to the running batch as soon as one finishes a token, instead of waiting for the whole batch to complete. Open full entry , speculative decoding, and Multi-LoRA inferenceruntimeServing many LoRA adapters concurrently on a single base model, with the runtime swapping the right adapter in per request rather than loading separate fine-tuned copies. Open full entry . Most open serving today runs on vLLM under the hood (vLLM repository).

SGLang is the credible alternative. Built around RadixAttention (prefix-tree-based KV cache sharing across requests with overlapping prompts) and a structured-generation DSL that constrains output to a grammar (SGLang paper, NeurIPS 2024). SGLang wins for workloads with high prompt-prefix overlap (agent traces with a shared system prompt) and for cases where you need guaranteed JSON output without sampling-and-retry.

TGI (Text Generation Inference, HuggingFace) was the production-grade open engine before vLLM ate its share. Still used in the HuggingFace ecosystem but no longer the default open recommendation (TGI repository).

TensorRT-LLM (NVIDIA) is the closed equivalent. NVIDIA-optimized kernels, tightest integration with H100/H200/Blackwell, used by the hyperscaler-hosted endpoints and many enterprise deployments (TensorRT-LLM repository). Performance-competitive with vLLM but not portable off NVIDIA.

Local-first runtime

The other half of the layer. Optimized for one user on a Mac or consumer GPU.

llama.cpp is the foundation. C++ implementation, GGUF quantization format (4-bit and 5-bit being the popular trade-off points), runs efficiently on Apple Silicon, x86 CPUs, and consumer NVIDIA / AMD GPUs. The reason “Llama-class models on a laptop” became normal in 2024-2026 (llama.cpp repository).

Ollama wraps llama.cpp in a “docker run” style CLI: pulls a model from a registry, runs it locally, exposes a simple HTTP API. Made local model inference accessible to people who didn’t want to deal with llama.cpp’s command-line flags (Ollama project).

MLX is Apple’s native ML framework for Apple Silicon. Tighter integration with the M-series unified memory than llama.cpp’s Metal backend, and increasingly the engine of choice for serious local-AI work on Macs (MLX repository).

LM Studio is the consumer GUI layer over llama.cpp. Not a runtime per se, but it’s how non-developers run local models on Windows and Mac.

The runtime-research pattern

A consistent observation across 2023-2026: the major runtime innovations emerge in open research first and get absorbed by closed runtimes a quarter or two later, not the reverse.

PagedAttention (open, vLLM, 2023) → adopted by closed serving infrastructure within months. FlashAttention (open, Stanford, 2022; FlashAttention-2 in 2023; FlashAttention-3 in 2024) → absorbed into TensorRT-LLM and into the hyperscaler-hosted endpoints. Speculative decoding (research papers, open implementations, then commercial adoption). RadixAttention (open, SGLang, 2024) → adopted by other engines.

The pattern matters because it reverses the usual default (“closed labs do the hard work, open catches up”). At the runtime layer the open ecosystem leads. That’s partly because the work happens at universities and partly because runtime optimizations are operator-replicable in a way that billion-dollar pretraining runs are not.

What’s open and what isn’t

Most of the runtime stack is open source.

  • Open server-class: vLLM, SGLang, TGI, llamafile, Aphrodite Engine, MII (Microsoft, open). Cover most production open-serving deployments.
  • Closed server-class: TensorRT-LLM, the proprietary inference engines inside Bedrock / Vertex / Together / Anyscale / Fireworks managed offerings.
  • Open local-first: llama.cpp, Ollama, MLX, LM Studio (UI), Jan, GPT4All.
  • Closed local-first: almost nothing. The local-first category is essentially open by default because there’s no enterprise revenue to defend.

The reverse-lock-in risk is at the managed-service layer. A team that deploys to Bedrock’s Claude or Together’s hosted DeepSeek isn’t using the open runtime even when the model is open-weights; they’re using a closed managed service whose runtime details are opaque. Portability across managed services is workable (the OpenAI-compatible API has become the de facto interchange format) but full sovereignty over the runtime requires self-hosting.

The editorial tension

Two observations that pull in different directions.

On technical merits, open runtime wins. The PagedAttention → RadixAttention → speculative-decoding lineage shows the open side leading the research and the closed runtimes catching up. For a sophisticated team that wants to optimize its own inference stack, vLLM plus SGLang plus llama.cpp is the strongest combination available.

On enterprise adoption, closed runtime still dominates by revenue. Most production AI inference billable to enterprise customers routes through Bedrock, Vertex AI, OpenAI’s API, or Anthropic’s API. The technical superiority of open runtime doesn’t translate into market share because enterprise procurement values “one throat to choke” over “lowest per-token cost”, and managed services bundle the runtime with the billing, the SLA, and the legal indemnification.

The strategic question for an open-AI advocate is whether the open runtime’s technical lead eventually breaks through to enterprise procurement, or whether the closed managed-service layer remains a permanent moat regardless of what’s underneath it. As of 2026, the second outcome looks more likely than the first, which is why open runtime advocates increasingly focus on enabling sovereign-state and individual-scale deployments rather than competing for hyperscaler API revenue head-on.

Key projects

10 catalogued · ordered open first · "Details" for a project page, "Ask" for an in-context chat

  • vLLM Open source

    Dominant open production inference engine; PagedAttention and continuous batching; NVIDIA / AMD / Intel / TPU support.

    Apache 2.0 · stable · GitHub · Details →
  • SGLang Open source

    RadixAttention plus structured generation; from the LMSYS team; gains for shared-prefix and agent workloads.

    Apache 2.0 · stable · GitHub · Details →
  • llama.cpp Open source

    Georgi Gerganov's local-first inference engine; defines the GGUF format; the on-device standard.

    MIT · stable · GitHub · Details →
  • Ollama Open source

    Local model runner; Docker-style UX over llama.cpp; the easiest way to run open weights on your machine.

    MIT · stable · GitHub · Details →
  • HuggingFace's production inference server; maintenance mode in 2026 as vLLM became the standard.

    Apache 2.0 · maintenance · GitHub
  • MLC-LLM Open source

    Cross-platform compilation (TVM-based); the 'LLM in your browser' or 'on your phone' standard.

    Apache 2.0 · beta · GitHub
  • Petals Open source

    Cooperative pipeline-parallel inference across volunteer devices; the canonical decentralized-inference reference project.

    MIT · research · GitHub
  • MLX (Apple) Open source

    Apple's open ML framework for Apple Silicon; enables local-first model running and small-cluster pipeline inference.

    MIT · stable · GitHub
  • TensorRT-LLM Source available

    NVIDIA's closed-runtime counterpart; fastest on NVIDIA hardware; depends on closed CUDA kernels and the proprietary TensorRT compiler.

    Source-available · stable
  • Unified memory architecture; closed silicon, but the strongest on-device inference platform via llama.cpp and MLX.

    Proprietary · stable · Details →