Glossary
MHA
Standard transformer attention where each layer has N independent query, key, and value heads; foundational but memory-heavy as context windows grow.
The attention mechanism introduced in the original 2017 transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry paper. Each layer has N independent attention heads, and each head has its own query, key, and value projection. The N heads run in parallel, their outputs concatenate, and a final projection mixes them back to the model’s hidden dimension. The “multi-head” part lets different heads specialize on different relational patterns (positional, syntactic, semantic).
The cost dominant in deployment is the KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix.
Open full entry : at decode time the
model has to remember every prior token’s key and value across every
head, which is N_heads x context_length x hidden_dim floats per
layer. For long contexts this dwarfs the model parameters themselves.
A Llama 2 13B at 8K context (40 attention heads, all-MHA) carries
several GB of KV cache per concurrent request before any quantization.
That memory bottleneck is why almost no current production model uses plain MHA. MQAruntimeAn attention variant where N query heads share a single key and value head, minimizing KV cache memory at a modest quality cost compared to multi-head attention. Open full entry (one shared KV head) and GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry (N query heads share K KV heads) both shrink the cache by 4x to 8x at small quality cost, and MLAruntimeAn attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache. Open full entry shrinks it further via a learned low-rank projection. Llama 2 7B and 13B were the last widely-used releases to use plain MHA; the Llama 2 70B already adopted GQA, and from LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry 3 onward every model in this catalog uses GQA, MLA, or a hybrid.