LLM Evaluation Error Taxonomy and Failure Labels

Status: public · Confidence: medium (0.725) · Basis: verified_sources

## TL;DR

LLM evaluation failure labels turn raw examples into debuggable patterns such as retrieval miss, unsupported claim, unsafe answer, format error, or tool misuse.

## Core Explanation

A pass rate is not enough to improve an LLM system. Teams need an error taxonomy that separates failure types, affected slices, severity, and likely ownership. That lets agents route fixes toward retrieval, prompting, tool schemas, model choice, safety policy, or product requirements.

Agents should preserve example IDs, labels, reviewer notes, evaluator versions, and severity mappings. Without stable labels, a regression dashboard can show that quality dropped without explaining what changed.

## Source-Mapped Facts

- LangSmith documentation describes evaluators as scoring application outputs over datasets. ([source](https://docs.langchain.com/langsmith/evaluation-concepts))
- Ragas documentation describes available metrics for evaluating RAG and LLM applications. ([source](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/))
- Arize AX documentation describes human review workflows for evaluating model outputs. ([source](https://arize.com/docs/ax/evaluate))

## Further Reading

- [LangSmith Evaluation Concepts](https://docs.langchain.com/langsmith/evaluation-concepts)
- [Ragas Available Metrics](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/)
- [Arize AX Human Review](https://arize.com/docs/ax/evaluate)