05 The KV cache

core

The model's working memory. It keeps generation usable and it is the hidden memory bill that grows with every token.

Adapted from Ahmad Osman, "LLMs 101: A Practical Guide (2026)".

The KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry is the model’s working memory during generation. It stores key/value attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry states for previous tokens so the model does not recompute the entire history from scratch on every new token. Without it, generation would be brutally inefficient. With it, generation is usable, but the cache costs memory.

That memory is proportional to tokens times layers times kv_heads times head_dim times precision times 2, where the factor of 2 covers keys and values. A useful rule of thumb for older Llama-like 7B MHAruntimeStandard transformer attention where each layer has N independent query, key, and value heads; foundational but memory-heavy as context windows grow. Open full entry models is roughly 0.5 MiB per token in FP16siliconA 16-bit floating-point format used as the default precision for deep learning training and inference, halving memory versus FP32 with small quality cost on most workloads. Open full entry KV cache. (Check it: 2 times 32 layers times 32 KV heads times 128 head_dim times 2 bytes is about 524,288 bytes, which is 0.5 MiB.) At that rate, 4K tokens costs about 2 GiB, and 32K tokens about 16 GiB of KV cache alone.

Newer GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry and MQAruntimeAn attention variant where N query heads share a single key and value head, minimizing KV cache memory at a modest quality cost compared to multi-head attention. Open full entry models cut this substantially, because they store fewer key/value heads. Some runtimes also support FP8siliconAn 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads. Open full entry or INT8 KV cache, which is the practical compression floor for most local users in 2026. Going below 8-bit is not a default: research systems show that 2-bit to 4-bit KV can work with careful algorithms, calibration, and custom kernels, but that is not the same as flipping a low-bit KV toggle in a desktop runtime. Below 8-bit, benchmark hard, especially for coding, tool calls, JSON, and long-context retrieval where exact earlier tokens matter.

Two distinctions prevent confusion. KV-cache quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry shrinks the live context memory; weight quantization shrinks the model itself; speculative decodingruntimeAn inference acceleration technique where a small fast draft model proposes several tokens at once and the target model verifies them in parallel, giving 2-3x speedup with no quality loss. Open full entry attacks decode latency by drafting future tokens and verifying them. They are different levers, and the last one does not reduce the KV-cache bill at all.

This is why a model can fit at an empty prompt and then fail when you load a long document. The weights fit. The working memory did not. The hardware side of that math (how much memory a given card has to spend on weights versus cache) is worked through in the self-host track’s GPU memory math module.