LLM Evaluation lm-eval Harness Task YAML

Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR

lm-eval task YAML and integrity checks make LLM benchmark runs easier for agents to reproduce and audit.

## Core Explanation

LLM benchmark results are only comparable when the task definition, prompt, model adapter, dataset, and code version are explicit. A harness task YAML can package much of that setup, while integrity checks reduce the risk of silent task breakage.

Agents should capture the task YAML path, harness commit, model adapter, few-shot setting, prompt text, dataset revision, metric, cache settings, and integrity-check status before citing a harness score.

## Source-Mapped Facts

- EleutherAI lm-evaluation-harness documentation describes the project as a framework that supports a wide range of zero-shot and few-shot evaluation tasks on autoregressive language models. ([source](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md))
- EleutherAI lm-evaluation-harness task documentation says YAML configuration files and the current codebase commit hash are intended to be shareable for replicating evaluation setup. ([source](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md))
- EleutherAI lm-evaluation-harness interface documentation lists --check_integrity as running task test suite validation before evaluation. ([source](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md))

## Further Reading

- [lm-evaluation-harness New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md)
- [lm-evaluation-harness Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md)
- [lm-evaluation-harness Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md)