Glossary

hybrid attention

An attention design that interleaves different mechanisms across layers, typically global plus sliding-window, to combine quality with long-context efficiency.

Runtime also: Weights aka hybrid-attention, interleaved attention, gqa-sliding

A transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry attention pattern where different layers use different attention mechanisms, interleaved across the depth of the network. The most common design pairs full global attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry with sliding window attentionruntimeAn attention pattern where each token attends only to a fixed window of recent tokens, trading global lookup for linear-cost inference at long sequence lengths. Open full entry , giving roughly half the cost of all-global attention while keeping the model’s ability to attend to distant tokens through the global layers.

Mistral 7B (September 2023) was the first widely-used open weightsweightsA model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness. Open full entry release to ship a hybrid pattern, combining GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry with a 4096-token sliding window. Gemma 2 (June 2024) extended the idea, alternating sliding-window and full-attention layers in a 1:1 ratio, with the window set to 4096 tokens against a 8192-token context. The Gemma 2 technical report frames this as an explicit cost/quality tradeoff: the sliding layers cost less per token, and the periodic full-attention layers carry the long-range routing capacity.

Hybrid designs occupy the middle of a spectrum that runs from pure sliding-window (cheap but no long-range lookup) to all-global (the original transformer cost). The exact ratio and window size are tuned per-model, and the attention-variant column on this catalog records them as hybrid-gqa-sliding to distinguish from plain GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry .

Sources

Mentioned in

sliding window attention

Back to glossary