Glossary

ALiBi

A positional encoding that adds a linear bias to attention scores based on the distance between tokens, with no learned position parameters and natural length extrapolation.

Runtime also: Training also: Weights aka attention with linear biases

An alternative to RoPE for injecting position into attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry . ALiBi adds a static bias to each attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry score equal to the negative distance between tokens, scaled by a per-head slope. Closer tokens get near-zero bias; distant tokens get progressively more negative scores, effectively a soft recency bias.

The headline property at release was train-short-test-long: a model trained on 1024-token inputs matched the perplexity of one trained on 2048 tokens when evaluated at that longer length, without fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry , which RoPEruntimeA positional encoding that rotates query and key vectors in two-dimensional subspaces by an angle proportional to their position, making attention scores depend on relative not absolute position. Open full entry without scaling could not. In practice the field converged on RoPEruntimeA positional encoding that rotates query and key vectors in two-dimensional subspaces by an angle proportional to their position, making attention scores depend on relative not absolute position. Open full entry with frequency scaling for long-context work, and ALiBi sits in fewer production models than it once did.

It still appears in some open weightsweightsA model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness. Open full entry families (early MPT, BLOOM, and several research-grade releases). Worth understanding as the contrast that motivated the RoPE-scaling tricks now standard in long-context fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry .

Sources

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (Press et al., 2022)

Back to glossary