Glossary

sliding window attention

An attention pattern where each token attends only to a fixed window of recent tokens, trading global lookup for linear-cost inference at long sequence lengths.

Runtime also: Weights aka sliding-window-attention, swa, sliding window, sliding-window

A local attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry pattern where each token attends only to a fixed-size window of preceding tokens (typically 256 to 4096) rather than the entire context. Cost scales linearly with sequence length instead of quadratically, which is the headline benefit. The trade-off is that distant tokens are no longer directly reachable in a single attention step; information has to propagate through the layers.

The technique was popularized for long documents by the Longformer paper in 2020. Mistral 7B in September 2023 brought sliding-window attention to the open-weights mainstream, using a 4096-token window inside an 8192-token context. The Mistral 7B paper reports the resulting memory and throughput wins that made the model competitive with the much larger Llama 2 13B at the time.

Sliding-window rarely appears in pure form in current frontier releases. The dominant pattern is hybrid attentionruntimeAn attention design that interleaves different mechanisms across layers, typically global plus sliding-window, to combine quality with long-context efficiency. Open full entry : alternate sliding layers with periodic full-attention layers so the model can still attend globally when needed. GemmaweightsGoogle's open-weight model family derived from Gemini research, with source-available licensing that includes an acceptable-use clause and license-revocation hook. Open full entry 2 is the cleanest example of this design. LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry 4 Scout’s 10M-token context combines sliding-window techniques with other extensions, though Meta has not fully disclosed the exact attention pattern in the model card.

Sources

Mentioned in

hybrid attention

Back to glossary