Glossary
perplexity
A measure of how well a language model predicts a text, equal to the exponential of the per-token cross-entropy loss; lower is better, often used for training diagnostics.
The exponential of the average negative log-likelihood per token. A perplexity of N is interpretable as “the model is on average as uncertain as if it were choosing uniformly among N options at each step.” Lower means the model assigned higher probability to the actual text, which usually correlates with better downstream behavior.
Perplexity is the diagnostic of choice during pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry and during quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry or fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry experiments. A drop in perplexity on a held-out set tells you the model improved at the next-token-prediction task; a spike during quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry tells you the precision loss is biting.
For end-task quality perplexity is unreliable: a model can have lower perplexity but lose at instruction following, refusal, or reasoning. For training diagnostics it remains the default because it is cheap to compute and stable across runs in a way that aggregate benchmarks are not.