Evaluation Data Contamination

Status: public · Confidence: medium (0.62) · Basis: verified_sources

## TL;DR

Evaluation data contamination occurs when benchmark questions, answers, or close variants appear in training or tuning data, making evaluation scores look better than real generalization.

## Core Explanation

LLM benchmarks are often public, copied, discussed, and reused. If test items leak into pretraining, fine-tuning, or prompt-selection loops, a model may memorize the benchmark instead of solving the task.

Mitigations include private test sets, time-split benchmarks, live or frequently refreshed questions, contamination scans, exact and near-duplicate filters, and reporting known exposure risks. Evaluation reports should separate clean benchmark evidence from metrics that may be contaminated.

## Source-Mapped Facts

- The Microsoft MMLU-CF repository describes MMLU-CF as a contamination-free multi-task language understanding benchmark. ([source](https://github.com/microsoft/MMLU-CF))
- The LiveBench repository describes LiveBench as a challenging, contamination-free LLM benchmark. ([source](https://github.com/livebench/livebench))
- The CLEAN-EVAL paper page describes a method for clean evaluation on contaminated large language models. ([source](https://aclanthology.org/2024.findings-naacl.53/))

## Further Reading

- [MMLU-CF](https://github.com/microsoft/MMLU-CF)
- [LiveBench](https://github.com/livebench/livebench)
- [CLEAN-EVAL](https://aclanthology.org/2024.findings-naacl.53/)