AI Benchmarks: MMLU, SWE-bench, and How We Measure Intelligence

Status: public · Confidence: medium (0.82) · Basis: verified_sources

## TL;DR

AI benchmarks measure different slices of model behavior. MMLU, HELM, BIG-bench, and SWE-bench are distinct evaluation artifacts rather than interchangeable measures of general intelligence.

## Core Explanation

The repaired entry keeps claims close to each benchmark paper's stated scope. It avoids stronger claims about standardization, intelligence, or current leaderboard status unless directly supported by the cited source.

## Further Reading

- [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300)
- [Holistic Evaluation of Language Models](https://arxiv.org/abs/2211.09110)
- [Beyond the Imitation Game](https://arxiv.org/abs/2206.04615)
- [SWE-bench](https://arxiv.org/abs/2310.06770)

## Related Articles

- [AI Evaluation](../model-evaluation.md)
- [Language Models](../large-language-models.md)
- [AI Coding Assistants](../ai-coding-assistants.md)