Glossary
state space model
An alternative to attention that processes sequences via a learned linear recurrence; scales linearly with sequence length where attention scales quadratically.
A neural architecture that replaces or supplements the attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry mechanism in a transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry with a learned linear recurrence over the sequence. Where attention’s compute and memory scale quadratically with context length, a state space model (SSM) scales linearly, which is the headline selling point for very long contexts.
The recurrence has the shape h_t = A h_(t-1) + B x_t, with A and B
learned. Earlier SSM variants (S4 in 2021, then H3) struggled to match
attention quality on language tasks. Mamba in December 2023 introduced
input-dependent A and B (the “selective” mechanism), which closed most
of the gap and showed Mamba-3B matching transformers of comparable
size on standard benchmarks at much lower inference cost on long
sequences.
In open-weights AI, SSMs remain a research line rather than the default. The trade-off is that the recurrence has no explicit key-value cache to look back through, so retrieval-heavy benchmarks (long-context needle-in-a-haystack tasks) historically favored attention. Hybrid designs that interleave SSM layers with attention layers (Jamba from AI21, Zamba from Zyphra) have been the main path to picking up the linear-scaling benefit without losing the lookup capability.
None of the open-weights checkpoints in this catalog are pure SSM as of 2026; the dominant architecture remains attention-based with GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry or MLAruntimeAn attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache. Open full entry for memory savings.