The Open-Source AI Stack
RSS

Glossary

state space model

An alternative to attention that processes sequences via a learned linear recurrence; scales linearly with sequence length where attention scales quadratically.

Weights also: Runtime aka ssm, mamba, state-space-model

A neural architecture that replaces or supplements the attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry mechanism in a transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry with a learned linear recurrence over the sequence. Where attention’s compute and memory scale quadratically with context length, a state space model (SSM) scales linearly, which is the headline selling point for very long contexts.

The recurrence has the shape h_t = A h_(t-1) + B x_t, with A and B learned. Earlier SSM variants (S4 in 2021, then H3) struggled to match attention quality on language tasks. Mamba in December 2023 introduced input-dependent A and B (the “selective” mechanism), which closed most of the gap and showed Mamba-3B matching transformers of comparable size on standard benchmarks at much lower inference cost on long sequences.

In open-weights AI, SSMs remain a research line rather than the default. The trade-off is that the recurrence has no explicit key-value cache to look back through, so retrieval-heavy benchmarks (long-context needle-in-a-haystack tasks) historically favored attention. Hybrid designs that interleave SSM layers with attention layers (Jamba from AI21, Zamba from Zyphra) have been the main path to picking up the linear-scaling benefit without losing the lookup capability.

None of the open-weights checkpoints in this catalog are pure SSM as of 2026; the dominant architecture remains attention-based with GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry or MLAruntimeAn attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache. Open full entry for memory savings.

Sources

Back to glossary