The Open-Source AI Stack
RSS

Glossary

YaRN

A position-encoding extension technique that lets a RoPE-pretrained model handle context windows longer than its training length without quality collapse.

Weights also: Training also: Runtime aka yarn, yarn-context-extension, yet-another-rope-extension

A technique for extending the context window of a transformer that was pretrained with RoPEruntimeA positional encoding that rotates query and key vectors in two-dimensional subspaces by an angle proportional to their position, making attention scores depend on relative not absolute position. Open full entry (RoPE), without retraining from scratch and without the quality cliff that naive extrapolation triggers. The name expands to “Yet another RoPE extensioN”, a self-deprecating nod to how many prior context-extension methods (Position Interpolation, NTK-aware scaling, ALiBi) had been proposed in the year before.

The mechanism modifies the RoPE rotation frequencies in a way that preserves the model’s training-time behavior at short contexts while generalizing smoothly to longer ones. The paper reports that YaRN can extend the context window using roughly 10x fewer tokens and 2.5x fewer training steps than prior context-extension methods, while keeping perplexity stable across the new range. Demonstrated on LLaMA models extending well beyond the original 2K training length.

In open-weights AI, YaRN became one of the standard tools for context-window extension. DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry V2, V2.5, V3, and R1 all use YaRN to reach 128K context from a shorter training length; the attention-variant column in this catalog tags those models with position_encoding: rope-yarn to distinguish from plain RoPE. QwenweightsAlibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters. Open full entry 2 and 2.5 also use YaRN for their long-context variants.

The technique is a good example of how a single targeted intervention can defer or eliminate the need for new pretraining when the goal is just to extend an existing capability.

Sources

Back to glossary