The Open-Source AI Stack
RSS

Glossary

MLA

An attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache.

Runtime also: Weights aka multi-head latent attention

The attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry variant DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry introduced in V2 and refined in V3. Instead of storing per-head key and value vectors directly, MLA learns a low-rank latent projection of the KV state and stores the smaller latent, decompressing on the fly during attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry . The cache per token drops to roughly 6 percent of an equivalent multi-head attention setup.

The smaller cache lets DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry run with much longer contexts and higher batch sizes on the same hardware than GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry -style alternatives. The trade-off is computational: the decompression adds FLOPs, and the custom kernels needed to make it fast are a maintenance burden compared to standard GQA.

Beyond DeepSeek the technique has been slower to spread; most other open families stick with GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry . Whether MLA becomes general or stays a DeepSeek signature is an open question through 2026.

Sources

Mentioned in

Back to glossary