Glossary

decode

The second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute.

Runtime also: Silicon also: Infrastructure aka decode phase, autoregressive decode, token generation

The second half of LLM inference. After prefillruntimeThe first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens. Open full entry has built the initial KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry , the model generates one token per step. Each step reads the model weights and the entire KV cache to produce the next token, then writes that token’s key and value back into the cache and repeats. The arithmetic intensity per step is low; the work is dominated by memory reads.

That memory-bandwidth dependence is the reason decode throughput tracks memory bandwidth much more closely than it tracks peak compute (FLOPs). An H100 with 3.35 TB/s of HBM bandwidth decodes faster than a Mac Studio M3 Ultra with 819 GB/s of unified memory, even though both can fit the same model. Same model, same parameter count, same weight format; the H100 simply moves bytes faster between memory and the compute units.

The decode bottleneck shapes most production-serving optimizations. Larger batches amortize the same weight-memory reads across more generated tokens. PagedAttentionruntimeAn attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently. Open full entry reduces KV-cache fragmentation so more requests fit. speculative decodingruntimeAn inference acceleration technique where a small fast draft model proposes several tokens at once and the target model verifies them in parallel, giving 2-3x speedup with no quality loss. Open full entry drafts cheap tokens with a small model and verifies them in parallel with the big one, raising effective throughput. KV-cache quantization halves the bytes read per step at small quality cost. All of these are bandwidth-savings techniques, dressed up in different clothes.

Sources

Mentioned in

Back to glossary