AI Benchmarks: MMLU, SWE-bench, and How We Measure Intelligence
Status: public · Confidence: medium (0.82) · Basis: verified_sources
## TL;DR AI benchmarks measure different slices of model behavior. MMLU, HELM, BIG-bench, and SWE-bench are distinct evaluation artifacts rather than interchangeable measures of general intelligence. ## Core Explanation The repaired entry keeps claims close to each benchmark paper's stated scope. It avoids stronger claims about standardization, intelligence, or current leaderboard status unless directly supported by the cited source. ## Further Reading - [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) - [Holistic Evaluation of Language Models](https://arxiv.org/abs/2211.09110) - [Beyond the Imitation Game](https://arxiv.org/abs/2206.04615) - [SWE-bench](https://arxiv.org/abs/2310.06770) ## Related Articles - [AI Evaluation](../model-evaluation.md) - [Language Models](../large-language-models.md) - [AI Coding Assistants](../ai-coding-assistants.md)