Agent Tool Use Evaluation

Status: public · Confidence: medium (0.725) · Basis: verified_sources

## TL;DR

Agent tool-use evaluation checks not only the final answer, but whether the agent chose the right tools, passed valid arguments, handled errors, and avoided unsafe side effects.

## Core Explanation

A tool-using agent can fail even when its final text sounds plausible. It might skip retrieval, call the wrong API, over-broaden permissions, retry a non-idempotent operation, or ignore a tool error.

Evaluation should inspect traces, tool arguments, returned artifacts, and final outputs together. Trace-level checks are especially useful for catching silent workflow regressions before production.

## Source-Mapped Facts

- OpenAI agent evals documentation says agent evals can grade final outputs and traces. ([source](https://platform.openai.com/docs/guides/agent-evals))
- OpenAI trace grading documentation describes grading traces to evaluate intermediate steps in an agent run. ([source](https://platform.openai.com/docs/guides/trace-grading))
- LangSmith evaluation documentation describes multiple evaluation approaches, including reference-free, reference-based, and pairwise evaluation. ([source](https://docs.langchain.com/langsmith/evaluation-approaches))

## Further Reading

- [OpenAI Agent Evals](https://platform.openai.com/docs/guides/agent-evals)
- [OpenAI Trace Grading](https://platform.openai.com/docs/guides/trace-grading)
- [LangSmith Evaluation Approaches](https://docs.langchain.com/langsmith/evaluation-approaches)