The Open-Source AI Stack
RSS

Glossary

prefill

The first phase of LLM inference, processing the input prompt and building the initial KV cache. Compute-bound and parallel across prompt tokens.

Runtime also: Silicon also: Infrastructure aka prefill phase, prompt processing

The first half of LLM inference. The prompt comes in as a sequence of tokens; the model runs a forward pass over every prompt token at once, producing the initial KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry that the decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry phase will read from one token at a time. Because every prompt token is processed in parallel, prefill is dominated by matrix-matrix multiplications and saturates GPU compute, not memory bandwidth.

The opposite shape, decode, generates one token per step and reads the entire KV cache plus the model weights each time. Decode is therefore memory-bandwidth-bound. The compute-bound prefill plus the bandwidth-bound decode is why a single workload can look completely different depending on whether the prompt is long and the answer short (prefill dominates) or the prompt is short and the answer long (decode dominates).

Production serving stacks treat prefill and decode as separate operations. PagedAttentionruntimeAn attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently. Open full entry manages the KV cache. continuous batchingruntimeA request-scheduling pattern where the inference engine adds new requests to the running batch as soon as one finishes a token, instead of waiting for the whole batch to complete. Open full entry reduces head-of-line blocking. Chunked prefill breaks long prompts into pieces so they don’t starve decode for any one request. Disaggregated serving (SGLang, TensorRT-LLM) separates prefill workers from decode workers entirely, passing the KV cache between them so neither phase blocks the other.

Sources

Mentioned in

Back to glossary