Evaluation Results · Hugging Face (original) (raw)

This is a work in progress feature.

The Hub provides a decentralized system for tracking model evaluation results. Benchmark datasets host leaderboards, and model repos store evaluation scores that automatically appear on both the model page and the benchmark’s leaderboard.

Benchmark Datasets

Dataset repos can be defined as Benchmarks (e.g., MMLU-Pro, HLE, GPQA). These display a “Benchmark” tag and automatically aggregate evaluation results from model repos across the Hub and display a leaderboard of top models.

Benchmark Dataset

Model Evaluation Results

Evaluation scores are stored in model repos as YAML files in the .eval_results/ folder. These results:

Model Evaluation Results

Adding Evaluation Results

To add evaluation results to a model, you can submit a PR to the model repo with a YAML file in the .eval_results/ folder.

Create a YAML file in .eval_results/*.yaml in your model repo:

Or, with only the required attributes:

Results display badges based on their metadata in the YAML file:

Badge Condition
verified A verifyToken is valid (evaluation ran in HF Jobs with inspect-ai)
community Result submitted via open PR (not merged to main)
leaderboard Links to the benchmark dataset
source Links to evaluation logs or external source

For more details on how to format this data, check out the Eval Results specifications.

Community Contributions

Anyone can submit evaluation results to any model via Pull Request:

  1. Go to the model page and click on the “Community” tab and open a Pull Request.
  2. Add a .eval_results/*.yaml file with your results.
  3. The PR will show as “community-provided” on the model page while open.

For help evaluating a model, see the Evaluating models with Inspect guide.

Community scores are visible while the PR is open. If a score is disputed, the model author can close the PR to remove it. The goal is to surface existing evaluation data transparently while building toward a fully reproducible standard via verified scores.

Registering a Benchmark

To register your dataset as a benchmark:

  1. Create a dataset repo containing your evaluation data
  2. Add an eval.yaml file to the repo root with your benchmark configuration, conform to the specification defined below.
  3. The file is validated at push time
  4. (Beta) Get in touch so we can add it to the allow-list.

Examples can be found in these benchmarks: GPQA, MMLU-Pro, HLE, GSM8K.

Eval.yaml specification

The eval.yaml should contain the following fields:

Required fields in each tasks[] item:

Optional fields in each tasks[] item:

When setting evaluation_framework: inspect-ai, one also requires to set the following fields:

Minimal example (required fields only):

name: MathArena AIME 2026 description: The American Invitational Mathematics Exam (AIME). evaluation_framework: math-arena

tasks:

Extended example:

name: MathArena AIME 2026 description: The American Invitational Mathematics Exam (AIME). evaluation_framework: "math-arena"

tasks:

Extended example ("inspect-ai"-specific):

name: Humanity's Last Exam description: > Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. evaluation_framework: "inspect-ai"

tasks:

Update on GitHub