AI Benchmarks: MMLU, SWE-bench, and How We Measure Intelligence

## TL;DR
AI benchmarks measure progress but are targets as much as tests. MMLU evaluates knowledge; HumanEval tests coding; SWE-bench measures real engineering; ARC-AGI probes abstraction. As models saturate existing benchmarks, new harder ones emerge (GPQA, Humanity's Last Exam).

## Core Explanation
Knowledge benchmarks: MMLU (57 subjects, multiple choice), MMLU-Pro (harder, 10-choice), GPQA Diamond (PhD-level science, <40% human expert). Coding: HumanEval (function completion), MBPP, LiveCodeBench (competitive programming). Reasoning: MATH, AIME (competition math), ARC-AGI (abstraction puzzles). Agentic: SWE-bench, Terminal-Bench, WebArena.

## Detailed Analysis
Benchmark limitations: training data contamination, prompt sensitivity (scores vary 5-10% with prompt changes), and saturation (when models score >95%). Dynamic benchmarks (LiveBench, Chatbot Arena ELO) update questions regularly. Evaluation taxonomy: zero-shot, few-shot, chain-of-thought, tool-augmented settings.

## Further Reading
- Epoch AI: Benchmarks Dashboard
- Stanford HELM: Holistic Evaluation
- LMSYS Chatbot Arena