Glossary

transformer

The neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM.

Runtime also: Training also: Weights

A stacked sequence of identical blocks. Each block has two sublayers: a self-attention mechanism that lets every token see every prior token, and a position-wise feed-forward network (FFN) that processes each position independently. Residual connections and layer norms surround both sublayers; positional encodings inject sequence order into the otherwise position-blind attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry .

The original 2017 paper introduced the architecture for machine translation. Decoder-only transformers became dominant for language modeling by 2020 with GPT-3 and remain the standard shape: hundreds of billions of parameters across dozens or hundreds of stacked blocks, trained autoregressively on web-scale text.

Variants reshape one component without changing the bones. MoE models replace the FFN with a sparse mixture of expertsweightsA model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute. Open full entry . Mamba and similar state-space architectures swap attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry for a recurrent operator with linear-time complexity. Diffusion language models (Gemini Diffusion, LLaDA-style) replace next-token prediction with iterative denoising. None has displaced the transformer at the frontierweightsThe current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold. Open full entry as of 2026.

Sources

Attention Is All You Need (Vaswani et al., 2017)

Mentioned in

Back to glossary