Glossary
lm-eval-harness
EleutherAI's open-source evaluation framework that runs hundreds of standardized benchmarks against any Hugging Face or OpenAI-compatible model, the de facto reference harness behind the Open LLM Leaderboard.
EleutherAI’s evaluation harness. It implements a couple hundred standardized benchmarkevaluationA standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default. Open full entry tasks (MMLUevaluationA multiple-choice benchmark covering 57 academic and professional subjects, once the default capability score, now largely saturated by frontier models above 88% accuracy. Open full entry , ARC, HellaSwag, GSM8K, IFEval, GPQA, MUSR, BBH, and many more) with a unified runner that supports Hugging FacetrainingThe model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer. Open full entry models, OpenAI-compatible APIs, vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry -served models, and several other backends.
The harness is the reference implementation behind the Open LLM leaderboardevaluationA ranked listing of models scored on one benchmark or aggregate, with LMArena and SWE-Bench Verified as the main 2026 reference points and the Open LLM Leaderboard now archived. Open full entry and behind most “we ran MMLUevaluationA multiple-choice benchmark covering 57 academic and professional subjects, once the default capability score, now largely saturated by frontier models above 88% accuracy. Open full entry on our model and got 78.4” claims in the open-weights community. Running standardized harness results is the table-stakes evaluation step for any model release.
Two things to know. Version pinning matters: harness updates have changed scoring conventions in ways that shift reported numbers by several points. And the harness measures what it measures: an aggregate of academic benchmarks. Agent and real-task evaluation needs separate tooling (SWE-Bench harness, τ-bench harness, etc).