Glossary
benchmark
A standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default.
A fixed dataset paired with a scoring rubric, run across many models to produce comparable numbers. Benchmarks dominate model evaluation because they are reproducible and comparable across time and across labs, even if their construct validity is often debated.
Two persistent problems. Saturation: as models improve, top benchmarks hit ceiling scores within a year or two of release (HellaSwag, GLUE, MMLUevaluationA multiple-choice benchmark covering 57 academic and professional subjects, once the default capability score, now largely saturated by frontier models above 88% accuracy. Open full entry all saturated and gave way to harder successors). Contamination: benchmark questions leak into pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry corpora, inflating scores without real capability gain.
Hard-but-not-yet-saturated benchmarks in 2026 cover reasoning (ARC-AGI, FrontierMath, HLE), agenticagentsAn informal descriptor for AI systems that pursue multi-step goals via tool use, planning, and self-correction, rather than single-turn question-answering. Open full entry behavior (SWE-Bench Verified, τ-bench, GAIA), and long-context (RULER, NIAH). The “leaderboardevaluationA ranked listing of models scored on one benchmark or aggregate, with LMArena and SWE-Bench Verified as the main 2026 reference points and the Open LLM Leaderboard now archived. Open full entry ” view of model progress is downstream of which benchmarks are still meaningful at any given time.