LLM Evaluation SWE-bench Verified Code-Agent Benchmarks

Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR

SWE-bench Verified helps agents interpret coding-agent benchmark claims in terms of real issue patches, Dockerized evaluation, and verified task subsets.

## Core Explanation

Code-agent evaluation is not just code generation. The system must understand an issue, inspect a repository, edit files, produce a patch, and pass tests in a controlled environment. SWE-bench-style results therefore mix model ability, scaffold design, repository tools, retry policy, and benchmark hygiene.

Agents citing a SWE-bench result should name the split, harness, scaffold, number of attempts, model, Docker setup, patch format, and resolved-instance metric. Without those details, a score is not enough evidence for production coding-agent quality.

## Source-Mapped Facts

- SWE-bench documentation describes a benchmark for evaluating whether language models can resolve real-world GitHub issues. ([source](https://www.swebench.com/SWE-bench/))
- The SWE-bench Verified page describes SWE-bench Verified as a human-validated subset of 500 SWE-bench instances for evaluating coding agents and language models. ([source](https://www.swebench.com/verified.html))
- The SWE-bench FAQ says the evaluation process sets up a Docker environment, applies a generated patch, runs the repository test suite, and determines whether the patch resolves the issue. ([source](https://www.swebench.com/SWE-bench/faq/))

## Further Reading

- [SWE-bench Overview](https://www.swebench.com/SWE-bench/)
- [SWE-bench Verified](https://www.swebench.com/verified.html)
- [SWE-bench FAQ](https://www.swebench.com/SWE-bench/faq/)