LLM Evaluation Benchmark Harnesses and Task Registries

Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR

LLM benchmark results are only comparable when the harness, task registry, prompt format, adapter, split, and aggregation rule are part of the evidence.

## Core Explanation

Evaluation harnesses turn model outputs into metrics, but the harness configuration is part of the result. A leaderboard number without task version, prompt template, adapter, and aggregation policy is hard to reproduce and easy to misread.

Agents should capture the exact task identifier, harness version or commit, dataset split, metric module, macro versus micro aggregation, model adapter, decoding parameters, and any filtering pipeline applied before scoring.

## Source-Mapped Facts

- Hugging Face Evaluate documentation says evaluation modules are split into metrics, comparisons, and measurements. ([source](https://huggingface.co/docs/evaluate/en/a_quick_tour))
- LM Evaluation Harness documentation says aggregate metric configuration can use weight_by_size true for micro-averaging or false for macro-averaging. ([source](https://lm-evaluation-harness.readthedocs.io/writing_tasks/groups_and_benchmarks/))
- HELM documentation says a run specification can include a scenario specification, adapter specification, metric specifications, and groups. ([source](https://crfm-helm.readthedocs.io/en/latest/code/))

## Further Reading

- [Hugging Face Evaluate Quick Tour](https://huggingface.co/docs/evaluate/en/a_quick_tour)
- [LM Evaluation Harness Groups and Benchmarks](https://lm-evaluation-harness.readthedocs.io/writing_tasks/groups_and_benchmarks/)
- [CRFM HELM Code Structure](https://crfm-helm.readthedocs.io/en/latest/code/)