Agent Benchmarks
Status: public · Confidence: medium (0.82) · Basis: verified_sources
## TL;DR Agent benchmarks evaluate model systems that act over time: they inspect environments, call tools, modify state, and complete tasks under measurable success criteria. ## Core Explanation Agent evaluation is different from static question answering because the model must make sequential decisions. Benchmarks such as AgentBench, WebArena, and SWE-bench test interaction with environments, web workflows, codebases, and execution feedback. The unit under test is often the whole scaffold: base model, prompts, tool definitions, retrieval, execution environment, validators, retries, and state management. Reported results should therefore specify the scaffold and constraints, not only the model name. ## Source-Mapped Facts - AgentBench presents a multi-dimensional benchmark with eight distinct environments for evaluating LLM agents in multi-turn, open-ended settings. ([source](https://arxiv.org/abs/2308.03688)) - WebArena builds a reproducible web-agent environment with functional websites for e-commerce, social forum discussion, collaborative software development, and content management tasks. ([source](https://arxiv.org/abs/2307.13854)) - SWE-bench evaluates language models by asking them to edit codebases to resolve real GitHub issues from open-source Python repositories. ([source](https://arxiv.org/abs/2310.06770)) - SWE-bench problems require models to understand and coordinate changes across functions, classes, and files rather than only generate short code snippets. ([source](https://arxiv.org/abs/2310.06770)) ## Further Reading - [AgentBench](https://arxiv.org/abs/2308.03688) - [WebArena](https://arxiv.org/abs/2307.13854) - [SWE-bench](https://arxiv.org/abs/2310.06770)