# Agent Tool Use Evaluation Status: public Confidence: medium (0.725) (verified) Last verified: 2026-06-02 Generation: ai_structured ## TL;DR Agent tool-use evaluation checks not only the final answer, but whether the agent chose the right tools, passed valid arguments, handled errors, and avoided unsafe side effects. ## Core Explanation A tool-using agent can fail even when its final text sounds plausible. It might skip retrieval, call the wrong API, over-broaden permissions, retry a non-idempotent operation, or ignore a tool error. Evaluation should inspect traces, tool arguments, returned artifacts, and final outputs together. Trace-level checks are especially useful for catching silent workflow regressions before production. ## Source-Mapped Facts - OpenAI agent evals documentation says agent evals can grade final outputs and traces. ([source](https://platform.openai.com/docs/guides/agent-evals)) - OpenAI trace grading documentation describes grading traces to evaluate intermediate steps in an agent run. ([source](https://platform.openai.com/docs/guides/trace-grading)) - LangSmith evaluation documentation describes multiple evaluation approaches, including reference-free, reference-based, and pairwise evaluation. ([source](https://docs.langchain.com/langsmith/evaluation-approaches)) ## Further Reading - [OpenAI Agent Evals](https://platform.openai.com/docs/guides/agent-evals) - [OpenAI Trace Grading](https://platform.openai.com/docs/guides/trace-grading) - [LangSmith Evaluation Approaches](https://docs.langchain.com/langsmith/evaluation-approaches)