Agent Evaluation Harnesses and Test Runs
Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR Agent evaluation harnesses turn example tasks, traces, graders, and thresholds into repeatable test runs for agent systems. ## Core Explanation Agent behavior changes when tools, prompts, models, memory, or policies change. A harness gives teams a repeatable way to run known cases, collect outputs, score them, and compare experiments. Agents should treat harness output as decision support. A failing eval can block risky changes, but humans still need to inspect whether the failure is a grader problem, a changed requirement, or a real regression. ## Source-Mapped Facts - LangSmith evaluation documentation says each experiment captures outputs, evaluator scores, and execution traces for every dataset example. ([source](https://docs.langchain.com/langsmith/evaluation-concepts)) - Promptfoo command-line documentation says the eval command can return a nonzero exit code when a test case fails or pass rate is below a configured threshold. ([source](https://www.promptfoo.dev/docs/usage/command-line/)) - OpenAI Evals repository describes Evals as a framework for evaluating LLMs and LLM systems. ([source](https://github.com/openai/evals)) ## Further Reading - [LangSmith Evaluation Concepts](https://docs.langchain.com/langsmith/evaluation-concepts) - [Promptfoo Command Line](https://www.promptfoo.dev/docs/usage/command-line/) - [OpenAI Evals](https://github.com/openai/evals)