Glossary

benchmark

A standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default.

Evaluation also: Training also: Weights aka evaluation benchmark

A fixed dataset paired with a scoring rubric, run across many models to produce comparable numbers. Benchmarks dominate model evaluation because they are reproducible and comparable across time and across labs, even if their construct validity is often debated.

Two persistent problems. Saturation: as models improve, top benchmarks hit ceiling scores within a year or two of release (HellaSwag, GLUE, MMLUevaluationA multiple-choice benchmark covering 57 academic and professional subjects, once the default capability score, now largely saturated by frontier models above 88% accuracy. Open full entry all saturated and gave way to harder successors). Contamination: benchmark questions leak into pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry corpora, inflating scores without real capability gain.

Hard-but-not-yet-saturated benchmarks in 2026 cover reasoning (ARC-AGI, FrontierMath, HLE), agenticagentsAn informal descriptor for AI systems that pursue multi-step goals via tool use, planning, and self-correction, rather than single-turn question-answering. Open full entry behavior (SWE-Bench Verified, tau-benchevaluationAn agentic benchmark where a model completes realistic retail and airline customer-service tasks through tool calls and simulated user turns, scored by reliability across repeated runs. Open full entry , GAIA), tool calling (Berkeley Function Calling LeaderboardevaluationA widely-cited benchmark for function-calling ability that checks emitted calls against expected ones by abstract-syntax-tree match and by executing them, across simple, parallel, and multi-turn cases. Open full entry ), and long-context (RULER, NIAH). The “leaderboardevaluationA ranked listing of models scored on one benchmark or aggregate, with LMArena and SWE-Bench Verified as the main 2026 reference points and the Open LLM Leaderboard now archived. Open full entry ” view of model progress is downstream of which benchmarks are still meaningful at any given time.

Sources

BIG-bench: Imitation Game (Srivastava et al., 2022)

Mentioned in

Back to glossary