Glossary

MMLU

A multiple-choice benchmark covering 57 academic and professional subjects, once the default capability score, now largely saturated by frontier models above 88% accuracy.

Evaluation aka Massive Multitask Language Understanding

A multiple-choice exam with 14,000+ questions across 57 subjects, ranging from elementary math to professional law. The benchmarkevaluationA standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default. Open full entry was designed as a capability proxy: a single number per model that summed breadth of knowledge. From 2020 to 2023 MMLU was the dominant headline score on every model release.

By 2024 leading models exceeded 88% on MMLU and the benchmarkevaluationA standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default. Open full entry stopped discriminating well at the top. The follow-up MMLU-Pro (Wang et al., 2024) increased question difficulty and the option count from 4 to 10, reintroducing discrimination at the cost of comparability with the older score.

The takeaway: a model “scoring high on MMLU” in 2026 is necessary but no longer sufficient. Practical model comparison has shifted to multi-benchmark suites (Open LLM leaderboardevaluationA ranked listing of models scored on one benchmark or aggregate, with LMArena and SWE-Bench Verified as the main 2026 reference points and the Open LLM Leaderboard now archived. Open full entry , HELM, Chatbot Arena) and to task-specific evaluation (coding, reasoning, agent behavior).

Sources

Mentioned in

Back to glossary