06 Prefill and decode

core

Two regimes with different costs. Prefill processes the prompt (time to first token); decode generates one token at a time (streaming speed).

Adapted from Ahmad Osman, "LLMs 101: A Practical Guide (2026)".

LLM inference has two performance regimes, and they behave nothing alike. prefillruntimeThe first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens. Open full entry processes the prompt you gave the model. If you paste a 20,000-token document, the model must process all 20,000 tokens before it can produce the first answer token. Prefill is relatively parallelizable, so accelerators handle it efficiently, but it can still be expensive. The wait for the first token to appear is usually prefill time.

decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry generates new tokens one at a time. Each new token depends on the sequence so far, so decode is sequential. This is where the streaming typing effect comes from, and it is usually the phase that decides whether a model feels fast or slow.

The shorthand: long prompts punish prefill, long answers punish decode, and long conversations punish both, because the KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry grows the whole time. In a chat session, every turn adds to the cache. If you let a conversation run to 16,000 tokens, you pay the memory cost of all 16,000 tokens on every new token generated. That is why chat interfaces that keep unbounded history eventually slow down or crash.

This split is worth internalizing before you care about tokens per second, because the two regimes are bound by different hardware limits. Prefill leans on compute; decode leans on memory bandwidth. The self-host track works through why two cards with the same memory capacity can have very different decode speeds.