The Open-Source AI Stack
RSS
← How LLMs work

04 Attention

core

Attention decides which earlier tokens matter. The variant chosen (MHA, MQA, GQA, MLA) sets the KV-cache bill.

Adapted from Ahmad Osman, "LLMs 101: A Practical Guide (2026)".

attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry is how a token decides which earlier tokens matter for the next prediction. It is also one of the reasons local inference is so memory-sensitive.

Classic MHAruntimeStandard transformer attention where each layer has N independent query, key, and value heads; foundational but memory-heavy as context windows grow. Open full entry stores separate key/value state for many heads. That gives the model flexibility, but it makes the KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry large. Modern local models often use more efficient designs. MQAruntimeAn attention variant where N query heads share a single key and value head, minimizing KV cache memory at a modest quality cost compared to multi-head attention. Open full entry has many query heads share a single key/value head: memory-efficient, sometimes less expressive. GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry lets groups of query heads share key/value heads, which is the common middle ground in current local models. Some recent models use MLAruntimeAn attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache. Open full entry , which compresses the key/value representation instead of dropping heads.

Kernels matter as much as the variant. FlashAttentionruntimeAn exact attention algorithm that reorders the computation to avoid materializing the full attention matrix in GPU HBM, giving 2 to 4 times speedup with no quality loss. Open full entry and SDPA-style implementations reduce attention memory traffic and keep the accelerator busier. A runtime with good attention kernels can be much faster than one without, on the same model and the same hardware.

This is why two 7B models can behave very differently at long context. Parameter count is not the whole story. A 7B MHA model at 128K context can exhaust a 24 GB GPU, while a 7B GQA model with the same advertised context may fit with room to spare. When comparing models for long-context work, look at attention type, KV-head count, context length, and runtime support, not just the parameter count. The memory those choices drive is the KV cache, which the next module makes concrete.