Evaluation Data Contamination
Status: public · Confidence: medium (0.62) · Basis: verified_sources
## TL;DR Evaluation data contamination occurs when benchmark questions, answers, or close variants appear in training or tuning data, making evaluation scores look better than real generalization. ## Core Explanation LLM benchmarks are often public, copied, discussed, and reused. If test items leak into pretraining, fine-tuning, or prompt-selection loops, a model may memorize the benchmark instead of solving the task. Mitigations include private test sets, time-split benchmarks, live or frequently refreshed questions, contamination scans, exact and near-duplicate filters, and reporting known exposure risks. Evaluation reports should separate clean benchmark evidence from metrics that may be contaminated. ## Source-Mapped Facts - The Microsoft MMLU-CF repository describes MMLU-CF as a contamination-free multi-task language understanding benchmark. ([source](https://github.com/microsoft/MMLU-CF)) - The LiveBench repository describes LiveBench as a challenging, contamination-free LLM benchmark. ([source](https://github.com/livebench/livebench)) - The CLEAN-EVAL paper page describes a method for clean evaluation on contaminated large language models. ([source](https://aclanthology.org/2024.findings-naacl.53/)) ## Further Reading - [MMLU-CF](https://github.com/microsoft/MMLU-CF) - [LiveBench](https://github.com/livebench/livebench) - [CLEAN-EVAL](https://aclanthology.org/2024.findings-naacl.53/)