04 Inference engines

self-host

The traffic cop, memory manager, scheduler, and API surface that turns hardware into served tokens.

An inference engine is the software layer between the model weights on disk and the tokens streamed back to the caller. It loads the weights in the right precision, tokenizes the prompt, runs the forward pass on whatever accelerator is present, samples the next token, maintains the KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry across steps, and streams output back to the client. Everything that determines how fast a given model serves on given hardware (memory layout, scheduler, batching, kernel selection, attention implementation) lives in this layer.

The workload has two phases with very different characteristics. prefillruntimeThe first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens. Open full entry processes the entire input prompt in parallel and fills the KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry ; it is compute-intensive and saturates the matrix-multiply units. decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry generates tokens one at a time and is memory-bandwidth-bound, since each step must stream the full activated weights once. An engine that is excellent at prefill can still be mediocre at decode, and vice versa. latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry measures prefill quality; latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry measures decode quality. The two metrics rarely move together.

The field has settled into four engine families. Portable local runtimes (GGUFweightsA binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023. Open full entry , MLC LLM, ONNX Runtime GenAI, OpenVINO) prioritize cross-platform support and on-device deployment over peak performance. They run on CPUs, integrated GPUs, NPUs, and a wide range of discrete accelerators. Apple unified-memory runtimes (MLX, MLX-LM) target the M-series shared memory model and exploit it with custom kernels that discrete-GPU engines cannot match. Consumer CUDA quant engines (ExLlamaV2, ExLlamaV3) target single or paired RTX cards with EXL2 or EXL3 quants and tight single-stream decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry . Production serving engines (vLLM, SGLang, TensorRT-LLM, TGI, LMDeploy) are designed for many concurrent requests, with PagedAttentionruntimeAn attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently. Open full entry -style KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry management, continuous batchingruntimeA request-scheduling pattern where the inference engine adds new requests to the running batch as soon as one finishes a token, instead of waiting for the whole batch to complete. Open full entry , and the operational features (metrics, structured output, multi-LoRA) that production deployments demand.

A new orchestration tier sits above engines. NVIDIA Dynamo coordinates fleets of inference workers across multiple nodes, doing KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry -aware request routing, disaggregated prefillruntimeThe first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens. Open full entry and decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry workers, and cross-node parallelism strategies. Dynamo does not replace engines; it orchestrates engines (typically TensorRT-LLM or vLLM) at fleet scale.

The one-page decision guide: laptop or general portability points to GGUFweightsA binary container format for quantized model weights used by llama.cpp and its ecosystem; the dominant on-device LLM file format since 2023. Open full entry ; Mac points to MLX; single RTX card points to ExLlamaV2; two to four NVIDIA cards point to ExLlamaV3; general production serving points to vLLM; long-context or MoE or complex routing points to SGLang; max NVIDIA performance points to TensorRT-LLM; cluster orchestration points to Dynamo. None of these are absolute. A benchmark on the actual workload is the deciding step, not the recommendation list.

The bottleneck taxonomy is what separates engines worth choosing from each other. Memory bandwidth dominates decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry on every modern engine, so engines that minimize unnecessary reads (KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry layout, dequantization strategy, weight-streaming order) win on latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry . KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry growth dominates memory pressure at long contexts; engines that page or share or quantize the cache (vLLM’s PagedAttentionruntimeAn attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently. Open full entry , SGLang’s prefix cachingruntimeA serving optimization that stores the KV cache for shared prompt prefixes (system prompts, few-shot examples) so subsequent requests reusing them skip the prefill compute. Open full entry ) serve longer contexts on the same hardware. Interconnect bandwidth (NVLink, NVSwitch, PCIe) dominates multi-GPU serving; tensor parallelismruntimeA way to split a single model across multiple GPUs by sharding each layer's weight matrices and doing an all-reduce after every layer. Bandwidth-hungry but layer-by-layer fine-grained. Open full entry needs all-reduce between every layer, so weak interconnect actively hurts. Scheduler quality (how the engine picks which requests to batch, how it manages preemption and chunked prefill) determines whether the engine gracefully degrades under load or thrashes. Runtime overhead (Python hot paths, framework dispatch, CUDA Graph use) sets the floor on small-model latency.

The reason engines do not converge is that these bottlenecks pull in different directions. An engine optimized for single-stream interactive use will spend its complexity budget on CUDA Graphs and aggressive kernel fusion; an engine optimized for many-concurrent serving will spend it on scheduling and KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry management. The same team cannot do both equally well, and the projects with the most mindshare specialize: vLLM is the throughput specialist, SGLang is the flexibility specialist, TensorRT-LLM is the NVIDIA-max specialist, llama.cpp is the portability specialist. Pick by workload shape, not by Twitter sentiment.