Glossary

attention

The transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors.

Runtime also: Training also: Weights

The central operation in a transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry . Each token produces three projections (query, key, value) from its hidden state. To compute the output for token i, take the dot product of its query against every prior key, softmax those scores, then take the weighted sum of the corresponding values. That gives the attention output for that position.

Computationally, attention is the part of a transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry that scales quadratically with sequence length, which is why long-context inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry is hard. Most modern runtime optimizations target attention specifically: FlashAttention reorders the memory access pattern to avoid materializing the full attention matrix, PagedAttention manages the KV cache like virtual memory, and GQA reduces the number of distinct key/value heads to shrink the cache.

Variants worth knowing: multi-head attention (the original), multi-query attention (MQA, one KV per head), GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry (GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry , shared KV across head groups), MLAruntimeAn attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache. Open full entry (MLAruntimeAn attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache. Open full entry , used by DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry -V3), and sliding-window attention (capped lookback for cheaper long-context).

Sources

Attention Is All You Need (Vaswani et al., 2017)

Mentioned in

Back to glossary