Agent Benchmarks

Status: public · Confidence: medium (0.82) · Basis: verified_sources

## TL;DR

Agent benchmarks evaluate model systems that act over time: they inspect environments, call tools, modify state, and complete tasks under measurable success criteria.

## Core Explanation

Agent evaluation is different from static question answering because the model must make sequential decisions. Benchmarks such as AgentBench, WebArena, and SWE-bench test interaction with environments, web workflows, codebases, and execution feedback.

The unit under test is often the whole scaffold: base model, prompts, tool definitions, retrieval, execution environment, validators, retries, and state management. Reported results should therefore specify the scaffold and constraints, not only the model name.

## Source-Mapped Facts

- AgentBench presents a multi-dimensional benchmark with eight distinct environments for evaluating LLM agents in multi-turn, open-ended settings. ([source](https://arxiv.org/abs/2308.03688))
- WebArena builds a reproducible web-agent environment with functional websites for e-commerce, social forum discussion, collaborative software development, and content management tasks. ([source](https://arxiv.org/abs/2307.13854))
- SWE-bench evaluates language models by asking them to edit codebases to resolve real GitHub issues from open-source Python repositories. ([source](https://arxiv.org/abs/2310.06770))
- SWE-bench problems require models to understand and coordinate changes across functions, classes, and files rather than only generate short code snippets. ([source](https://arxiv.org/abs/2310.06770))

## Further Reading

- [AgentBench](https://arxiv.org/abs/2308.03688)
- [WebArena](https://arxiv.org/abs/2307.13854)
- [SWE-bench](https://arxiv.org/abs/2310.06770)