# LLM Evaluation WebArena Web-Agent Benchmarks Status: public Confidence: medium (0.79) (verified) Last verified: 2026-06-03 Generation: ai_structured ## TL;DR WebArena-style evaluation tests whether an agent can use websites over time, not merely answer static web questions. ## Core Explanation Web-agent benchmarks exercise observation, planning, UI action, state tracking, and recovery. A result is only interpretable when the environment version, task configuration, observation channel, action set, and evaluator are explicit. Agents using WebArena evidence should preserve task ID, website domain, reset state, accessibility or DOM observation mode, action trace, terminal/browser errors, and evaluator output. Otherwise a pass rate cannot explain whether the failure was perception, planning, tool execution, or environment setup. ## Source-Mapped Facts - The WebArena paper introduces a web environment for autonomous agents based on functional websites from several real-world domains. ([source](https://webarena.dev/static/paper.pdf)) - The WebArena repository describes WebArena as a standalone, self-hostable web environment for building autonomous agents. ([source](https://github.com/web-arena-x/webarena)) - The WebArena repository includes environment, agent, configuration, and evaluation-harness code directories. ([source](https://github.com/web-arena-x/webarena)) ## Further Reading - [WebArena Paper PDF](https://webarena.dev/static/paper.pdf) - [WebArena Repository](https://github.com/web-arena-x/webarena)