03 The Transformer

core

Most chat models are decoder-only Transformers: embeddings, attention, feed-forward blocks, stacked and projected to logits.

Adapted from Ahmad Osman, "LLMs 101: A Practical Guide (2026)".

Most modern LLMs are based on the transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry architecture, and most local chat models are decoder-only Transformers: they predict the next token while looking back only at previous tokens. Everything before this point (tokens, weights, config, chat templates) is setup. The Transformer is the engine underneath.

A simplified Transformer layer has a few parts. Token embeddings turn token IDs into vectors. Positional information gives the model token order; many modern LLMs use RoPEruntimeA positional encoding that rotates query and key vectors in two-dimensional subspaces by an angle proportional to their position, making attention scores depend on relative not absolute position. Open full entry (rotary position embeddings), which encodes position by rotating representations. Self-attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry lets each token look back at earlier tokens and decide what matters. A feed-forward block (the MLP) expands and compresses each representation through a dense nonlinearity, and a large fraction of a model’s parameters live here. Layer normalization and residual connections stabilize deep stacks and help information flow through many layers.

An output projection then turns the final hidden state into logits over the vocabulary. Stack this recipe dozens or hundreds of times and you have a language model.

Two points are worth holding onto. First, attention is order-agnostic on its own, so the positional signal (RoPE and its relatives) is what keeps word order straight. Second, because so many parameters sit in the feed-forward blocks, parameter count alone tells you little about a model’s attention cost at long context. That cost is the subject of the next two modules.