06 Production serving

self-host

Prefill, decode, batching, scheduling, parallelism. The system around the model.

Production serving is the engineering around the model that turns a checkpoint into a service. Six concerns recur: phase separation, KV cache management, batching, parallelism, speculative decoding, and disaggregation. Each is implemented differently by each serving engine; together they determine whether a deployment hits its SLA at target concurrency, or thrashes under load.

prefillruntimeThe first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens. Open full entry and decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry are different workloads that share a model. Prefill processes the entire input prompt in one pass, runs the matrix-multiply units at saturation, and is compute-bound on most hardware. Decode generates one token at a time, streams the activated weights through the compute units per step, and is memory-bandwidth-bound. latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry is dominated by prefill; latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry is dominated by decode. Serving systems that conflate the two phases miss optimizations available only when each phase is scheduled and parallelized appropriately.

PagedAttentionruntimeAn attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently. Open full entry (Kwon et al., SOSP 2023) reorganized KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry management around fixed-size blocks tracked by per-request page tables, the way an operating system manages virtual memory. The result is near-zero memory fragmentation and the ability to share identical prefix blocks across requests. prefix cachingruntimeA serving optimization that stores the KV cache for shared prompt prefixes (system prompts, few-shot examples) so subsequent requests reusing them skip the prefill compute. Open full entry falls out naturally: if two requests share the first 2000 tokens of prompt, the KV blocks for those tokens are computed once and reused. For agents with repeated system prompts or RAG pipelines with shared retrieved context, prefix caching cuts prefill cost by an order of magnitude or more.

continuous batchingruntimeA request-scheduling pattern where the inference engine adds new requests to the running batch as soon as one finishes a token, instead of waiting for the whole batch to complete. Open full entry lets the serving engine add new requests to a running batch mid-decode. The alternative (static batching, where the engine waits for a full batch before starting and finishes all requests together) wastes GPU cycles whenever requests finish at different times. Combined with chunked prefill (interleaving prefill chunks with decode steps from other requests) and KV quantization (storing the KV cache in lower precision), these techniques are what take an engine from acceptable single-stream performance to acceptable high-concurrency performance.

Parallelism strategies multiply by what they need from the interconnect. tensor parallelismruntimeA way to split a single model across multiple GPUs by sharding each layer's weight matrices and doing an all-reduce after every layer. Bandwidth-hungry but layer-by-layer fine-grained. Open full entry splits each layer’s matrix multiplications across GPUs and requires an all-reduce after every transformer block; it needs NVLink or NVSwitch to be efficient, and it actively underperforms on PCIe. Pipeline parallelism splits the model across stages with each GPU holding a slice of the layers; it tolerates slower interconnects but adds pipeline-bubble latency. expert parallelismruntimeA parallelism strategy for mixture-of-experts models where different GPUs hold different experts; requires all-to-all communication on every token routing step. Open full entry for mixture of expertsweightsA model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute. Open full entry models places different experts on different GPUs and requires all-to-all routing each step; it scales mixture of expertsweightsA model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute. Open full entry serving but demands fast interconnect. Data parallelism replicates the full model on each GPU and partitions the request stream; it works when the model fits on one GPU and the bottleneck is throughput. Context parallelism splits very long sequences across GPUs at attention time.

speculative decodingruntimeAn inference acceleration technique where a small fast draft model proposes several tokens at once and the target model verifies them in parallel, giving 2-3x speedup with no quality loss. Open full entry (Leviathan et al., 2023) attacks the decode bandwidth bound directly. A small draft model proposes several tokens cheaply; the target model verifies them in parallel, accepting the prefix that matches its own predictions. When draft and target agree often (typical for in-distribution prompts), speculative decoding yields 1.5x to 3x speedup on decode without changing the model’s output distribution. The gain depends on draft quality and acceptance rate; the cost is the additional draft-model memory and the verification step.

Disaggregated serving separates prefill workers from decode workers onto different machines or different GPUs within a machine. Prefill is compute-bound and benefits from high-FLOPS hardware; decode is memory-bandwidth-bound and benefits from high-bandwidth memory. The two phases have different optimal hardware shapes and different optimal batching strategies. SGLang and TensorRT-LLM both ship disaggregated modes. The KV cache produced by the prefill worker is transferred to the decode worker over a fast interconnect (NVLink, InfiniBand, or GPU-direct RDMA). NVIDIA Dynamo coordinates this across fleets, with KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry -aware request routing that places follow-up requests on workers that already hold the relevant cache blocks.

The SLA metrics that matter for production: latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry at p50, p95, and p99 for the latency budget; latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry at the same percentiles for the streaming experience; end-to-end latency for non-streaming calls; throughput in tokens per second per GPU; cost per million tokens at sustained load. Averages hide the long tail; the p99 is where users notice. A serving deployment with excellent p50 numbers and bad p99 numbers will get complaints proportional to the p99 gap, not the p50 mean.