Glossary

leaderboard

A ranked listing of models scored on one benchmark or aggregate, with LMArena and SWE-Bench Verified as the main 2026 reference points and the Open LLM Leaderboard now archived.

Evaluation

The frontend of the benchmarkevaluationA standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default. Open full entry ecosystem. A leaderboard ranks models on a fixed evaluation pipeline; sometimes it is a single benchmarkevaluationA standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default. Open full entry , often it is a weighted aggregate of several. The visibility leaderboards provide is the practical mechanism by which a new open-weights model gets adopted.

Three patterns of leaderboard exist. Static-benchmark leaderboards (MTEB, GPQA, the archived Open LLM Leaderboard) run a fixed evaluation suite. Arena-style leaderboards (LMArena, formerly Chatbot Arena) use pairwise human ranking across blind matchups. Task-specific leaderboards (SWE-Bench, GAIA, ARC-AGI) measure performance on one well-defined task.

Each format has failure modes. Static benchmarks contaminate and saturate. Arena rankings reward style and tone over correctness. Task-specific leaderboards lose generality. The healthy view is to read several rather than trust any one.

Sources

Mentioned in

Back to glossary