The Open-Source AI Stack
RSS

Glossary

MQA

An attention variant where N query heads share a single key and value head, minimizing KV cache memory at a modest quality cost compared to multi-head attention.

Runtime also: Weights aka multi-query attention, multi-query-attention, multiquery attention

A simplification of MHAruntimeStandard transformer attention where each layer has N independent query, key, and value heads; foundational but memory-heavy as context windows grow. Open full entry where every query head shares a single key and value head. The cost saving is dramatic: the KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry shrinks by a factor of N_heads (typically 32 to 128 times) compared to MHA, which directly translates to higher batch sizes or longer contexts at the same memory budget. The quality cost is small but real, and the technique was less popular until GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry arrived as a middle ground.

Noam Shazeer (a Google researcher, later at Character AI) proposed MQA in 2019 with the title “One Write-Head is All You Need.” Falcon 180B (Sep 2023, TII) was the most prominent early open weightsweightsA model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness. Open full entry adoption. PaLM also used MQA. Most current frontier models use GQA or MLA instead, both of which preserve more quality while keeping most of the memory win.

The architecture-variant taxonomy on this catalog distinguishes MQA, GQA, MLA, and full MHA so readers can see at a glance which family of trade-offs each checkpoint made. Pure MQA is rare in 2026 models because GQA is almost always a better Pareto point on quality vs memory.

Sources

Mentioned in

Back to glossary