Glossary

context window

The maximum number of tokens a model can attend to in a single forward pass, set during pretraining and extended (sometimes) via fine-tuning or training-free extrapolation tricks.

Runtime also: Training also: Weights aka context length

The hard upper bound on sequence length the model can consume or produce in one pass. pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry defines a context length (8K, 32K, 128K, 1M tokens); the positional encoding scheme (RoPE, ALiBi, NoPE) and the attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry pattern (full, sliding window, attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry sink) determine how gracefully the model handles or extends beyond it.

Quality degrades as context lengthens. The “lost in the middle” finding (Liu et al., 2023) showed accuracy drops sharply on information placed in the middle of a long context window even when the model nominally supports the length. Long-context benchmarks like RULER and Needle-in- a-Haystack measure this directly.

Practical implications for serving: the KV cache grows linearly with context length, so a 1M-token request costs roughly 125 times more memory than an 8K-token request. Most production deployments cap usable context far below the model’s nominal limit for cost reasons.

Sources

RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)

Mentioned in

Back to glossary