Glossary

GQA

An attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention.

Runtime also: Training also: Weights aka grouped query attention, grouped-query attention

A middle ground between multi-head attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry (one set of keys and values per query head, expensive KV cache) and multi-query attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry (one set of keys and values shared across all query heads, smallest cache but quality drops). GQA groups query heads and shares KV pairs within each group. Typical configurations: 8 query heads sharing 2 KV heads (4-way grouping) or 32 query heads sharing 8 KV heads.

Quality loss is small in practice. KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry memory drops in proportion to the grouping factor, which directly enables larger batches or longer contexts on the same hardware.

GQA is the default in most modern open weightsweightsA model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness. Open full entry families: LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry 3, LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry 4, QwenweightsAlibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters. Open full entry , MistralweightsA French open-weight model family from Mistral AI, released mostly under Apache 2.0 with strong performance per parameter and notable MoE variants (Mixtral, Mixtral 8x22B). Open full entry , GemmaweightsGoogle's open-weight model family derived from Gemini research, with source-available licensing that includes an acceptable-use clause and license-revocation hook. Open full entry , and DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry base variants all use it. The related MLAruntimeAn attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache. Open full entry (MLAruntimeAn attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache. Open full entry ) in DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry -V3 takes the idea further with a learned low-rank projection of the KV state.

Sources

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., 2023)

Mentioned in

Back to glossary