Evaluation

What it is

Overview

How we measure progress. The benchmarks, harnesses, and leaderboards that decide which models are “better” and which agent products work. This is a meta-layer because it observes the rest of the stack rather than sitting between two production-pipeline layers. Every claim of “better” anywhere else on the stack depends on what evaluation accepts as proof.

Five things to keep in mind as you read:

Evaluation is the layer that grades the rest. Without it, “open is competitive” and “open is behind” are both unfalsifiable.
Open coverage is good. lm-eval-harnessevaluationEleutherAI's open-source evaluation framework that runs hundreds of standardized benchmarks against any Hugging Face or OpenAI-compatible model, the de facto reference harness behind the Open LLM Leaderboard. Open full entry , HELM, AISI Inspect, LMArena. The infrastructure is mostly open.
The unsolved problems aren’t openness. They’re contamination (eval data leaked into training data) and saturation (benchmarks topped out).
A small set of benchmarks still discriminates. HLE, GPQA Diamond, FrontierMath, SWE-bench Verified.
Closed labs run private evals. AISI, METR, and Apollo Research are the credible semi-public counterweights.

The rest of this page walks the open harnesses, the contamination-and-saturation problems, the frontier-resistant benchmarks, and the private-eval counterweight question.

The open harnesses

What the open ecosystem uses to score models.

lm-evaluation-harness (EleutherAI) is the de facto reference. Hundreds of academic benchmarks behind a single CLI. Cited in nearly every open-weights release paper from 2022 onward. The “if you don’t know which harness to use, use this” answer (lm-eval repository).

HELM (Stanford CRFM) is the principled large-scale alternative. Defines a fixed taxonomy of scenarios and metrics (accuracy, robustness, fairness, calibration, etc.) and runs every model on the same fixed grid for apples-to-apples comparison (HELM project).

AISI Inspect (UK AI Safety Institute) is the safety-evaluation-focused framework. Designed for capability evaluations (can the model do dangerous thing X?) rather than benchmark-style accuracy tests (AISI Inspect repository).

LMArena (Chatbot Arena, originally LMSYS) is the human-preference evaluation. Anonymous head-to-head comparisons between models from real users, Elo-ranked. The most-cited “real users prefer X over Y” data source in the field through 2025 (LMArena leaderboard).

The closed equivalents are mostly internal-only. Anthropic, OpenAI, and Google all run substantial internal evaluation suites that don’t get published; what they release publicly is typically selected results from public benchmarks.

Contamination and saturation

The two structural problems that aren’t about openness.

Contamination is when a model was trained on data that includes the evaluation. A model trained on the internet may have seen MMLU questions verbatim in training; if so, its score is artificially inflated. Detection methods exist (perplexity-on-test-set, membership inference attacks) but they’re imperfect and the labs aren’t incentivized to self-report. The most-trusted defense is to evaluate on benchmarks released AFTER the model’s training cutoff.

Saturation is when a benchmark is so close to its ceiling that it no longer separates models. MMLU saturated for frontier models around 2023-2024; GSM8K saturated through 2024. The benchmarks that made open-vs-closed comparisons meaningful five years ago no longer do, because the leading models cluster within the benchmark’s noise floor.

Both problems push the eval community toward newer, harder, private-test-set benchmarks faster than the academic publication cycle can handle. The frontier-resistant set turns over roughly annually.

The frontier-resistant benchmarks (2026)

The current set that still discriminates between top models.

HLE (Humanity’s Last Exam, CAIS + Scale AI, 2025) — 3000 questions across math, science, humanities, designed explicitly to be at the upper bound of human expert capability. Frontier models score 20-40% as of mid-2026 (HLE paper).
GPQA Diamond (Rein et al., 2023) — 198 graduate-level science questions where experts in adjacent fields still score below 50%. Frontier models cracked 70% in late 2025 (GPQA paper).
FrontierMath (Epoch AI, 2024) — research-grade mathematics, designed to be hard for both humans and models. Frontier models score below 10% as of 2026 (FrontierMath benchmark).
SWE-bench Verified (Princeton + OpenAI, 2024) — real GitHub issues that require multi-file edits to resolve, with a hand-verified subset. Tests agent capability, not just language modeling (SWE-bench Verified release).
AIDER LLM benchmark — code-editing benchmark from the Aider project; less academic but more representative of real coding-agent work (Aider leaderboard).

The pattern is that benchmarks resistant to saturation share two properties: they require expertise that’s hard to scrape from the internet, and they have private test sets that the labs can’t pretrain on. When either property fails, the benchmark saturates quickly.

The capability-evaluation counterweight

Beyond model-quality benchmarks, a separate evaluation practice has grown around dangerous-capability assessment.

AISI (UK AI Safety Institute) — runs pre-release evaluations on frontier models from Anthropic, OpenAI, Google DeepMind under voluntary agreements (AISI blog).
METR (Model Evaluation and Threat Research) — focused on autonomy and agentic-capability evaluations; runs the largest external evaluations of frontier models for catastrophic-risk indicators (METR research).
Apollo Research — focused on deception, scheming, and in-context reasoning failures (Apollo research).

These three together are the closest the field has to independent verification of what frontier models can do. Their methods are mostly public; specific evaluation results from specific model-lab pairs are typically under non-disclosure. The arrangement is uncomfortable (the evaluators depend on voluntary lab cooperation) but is currently the only path to any external scrutiny at all.

What’s open and what isn’t

Open harnesses: lm-evaluation-harness, HELM, AISI Inspect. Cover most of the evaluation surface.
Open benchmark datasets: MMLU, GSM8K, GPQA, ARC, HellaSwag, plus hundreds of smaller-scale academic benchmarks. Mostly under permissive academic licenses.
Private-test-set benchmarks: HLE, FrontierMath, SWE-bench Verified. The harness is open, the test set is held back to prevent contamination. Standard practice for saturation-resistant benchmarks.
Closed labs’ internal evals: the most extensive evaluation work in the field, and almost entirely unpublished.

The reverse-lock-in risk at this layer is that the closed-lab internal evals shape the actual training trajectories of the closed models, while the public benchmarks shape what we know about them. The gap between “what closed labs actually know about their models” and “what the public can verify about them” is the largest information asymmetry on the stack.

The editorial tension

The accountable-AI argument for evaluation is that without external verification of what models can do, claims about safety, capability, and alignment are unfalsifiable. The AISI / METR / Apollo work is the strongest version of that argument actually shipping.

The pragmatic counterargument is that the public benchmark canon is saturated for the questions that matter most (frontier capability, dangerous behavior), and that the private-test-set benchmarks are run by a small number of organizations the field is increasingly dependent on. If HLE’s questions leak, or FrontierMath’s curators get bought, the field loses one of its few remaining measurement instruments at frontier scale.

The strategic question is whether the public-interest evaluation infrastructure (AISI, METR, Apollo, plus the academic benchmark community) scales with the labs it evaluates. If yes, the field has a real counterweight. If no, the labs increasingly grade themselves, and “open-weights beats closed weights” becomes a claim no external party can verify.

Key projects

10 catalogued · ordered open first · "Details" for a project page, "Ask" for an in-context chat

lm-evaluation-harness Open source

EleutherAI's standard harness; reference implementation for most public evals; the open-ecosystem default.

MIT · stable · GitHub
LMArena (Chatbot Arena) Open source

Pairwise human voting converted to Elo; gold standard for 'how do humans prefer this model'; hard to game.

Apache 2.0 · stable
MMLU / MMLU-Pro Open source

Canonical multitask benchmark (57 subjects); v1 mostly saturated; MMLU-Pro is the harder successor.

MIT · maintenance
HumanEval / HumanEval+ Open source

Foundational code-completion benchmark; HumanEval+ extends test cases; superseded by LiveCodeBench for contamination resistance.

MIT · maintenance
SWE-bench Verified Open source

Real GitHub issues from popular Python projects; the de facto for coding-agent capability.

MIT · stable · GitHub
LiveBench / LiveCodeBench Open source

Contamination-aware benchmark with monthly question refresh; resists training-set leakage.

MIT · stable
GPQA Diamond Open source

Graduate-level physics, chemistry, biology questions; hard, slow to saturate.

CC-BY-4.0 · stable
Humanity's Last Exam (HLE) Open source

Center for AI Safety and Scale AI joint; designed as the 'final boss' benchmark.

MIT · new
AISI Inspect Open source

UK AI Security Institute's evaluation framework for AI safety; open-sourced; widely adopted in safety research.

MIT · stable · GitHub
FrontierMath Source available

Currently the hardest math benchmark; frontier labs negotiated partial access deals.

Source-available · stable

What it is

Overview

The open harnesses

Contamination and saturation

The frontier-resistant benchmarks (2026)

The capability-evaluation counterweight

What’s open and what isn’t

The editorial tension

Key projects

Grants flowing in

Reading list

Papers

Posts

Docss