Glossary

KV cache

The stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix.

Runtime also: Silicon aka kv-cache, key-value cache

During autoregressive generation, each new token’s attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry requires the key and value projections of every prior token. Recomputing them on every step would scale quadratically with sequence length. The KV cache stores them after the first forward pass and looks them up on every subsequent step, turning each decode step into a linear operation in prior length.

The cost is memory. KV cache size scales as 2 × num_layers × num_kv_heads × head_dim × sequence_length × bytes_per_value. For LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry -3-70B at 8K context on bf16 (80 layers, 8 grouped-query KV heads, 128 head_dim, 2 bytes/value), that works out to roughly 2.6 GB per concurrent request. The KV cache is why long-context inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry is expensive even though the model weights themselves do not change.

Modern runtimes treat KV cache as the resource that constrains throughput more than weights do. PagedAttention manages it like virtual memory; GQA reduces it by sharing KV heads across groups; quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry -of-KV (FP8siliconAn 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads. Open full entry or INT8 KV) cuts it in half or quarter; offloading to host RAM extends it at the cost of latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry .

Sources

Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023)

Mentioned in

Back to glossary