# LLM Evaluation SWE-bench Verified Code-Agent Benchmarks Status: public Confidence: medium (0.725) (verified) Last verified: 2026-06-03 Generation: ai_structured ## TL;DR SWE-bench Verified helps agents interpret coding-agent benchmark claims in terms of real issue patches, Dockerized evaluation, and verified task subsets. ## Core Explanation Code-agent evaluation is not just code generation. The system must understand an issue, inspect a repository, edit files, produce a patch, and pass tests in a controlled environment. SWE-bench-style results therefore mix model ability, scaffold design, repository tools, retry policy, and benchmark hygiene. Agents citing a SWE-bench result should name the split, harness, scaffold, number of attempts, model, Docker setup, patch format, and resolved-instance metric. Without those details, a score is not enough evidence for production coding-agent quality. ## Source-Mapped Facts - SWE-bench documentation describes a benchmark for evaluating whether language models can resolve real-world GitHub issues. ([source](https://www.swebench.com/SWE-bench/)) - The SWE-bench Verified page describes SWE-bench Verified as a human-validated subset of 500 SWE-bench instances for evaluating coding agents and language models. ([source](https://www.swebench.com/verified.html)) - The SWE-bench FAQ says the evaluation process sets up a Docker environment, applies a generated patch, runs the repository test suite, and determines whether the patch resolves the issue. ([source](https://www.swebench.com/SWE-bench/faq/)) ## Further Reading - [SWE-bench Overview](https://www.swebench.com/SWE-bench/) - [SWE-bench Verified](https://www.swebench.com/verified.html) - [SWE-bench FAQ](https://www.swebench.com/SWE-bench/faq/)