LLM Evaluation Assertions and Test Cases

Status: public · Confidence: medium (0.685) · Basis: verified_sources

## TL;DR

LLM evaluation test cases need explicit inputs, expected behavior, assertions, thresholds, and metrics so failures can be reproduced and repaired.

## Core Explanation

An eval case is more than a prompt. It records the variables, expected output or rubric, assertion type, threshold, metric name, and sometimes a custom scoring function. Deterministic assertions catch schema, substring, regex, refusal, and latency failures; model-assisted metrics can grade relevance, faithfulness, factuality, or trajectory behavior.

Agents should read eval definitions before changing prompts or tools. A failing assertion usually tells which contract broke, while an aggregate score alone often hides the failing behavior.

## Source-Mapped Facts

- Promptfoo documentation says assertions compare LLM output against expected values or conditions. ([source](https://www.promptfoo.dev/docs/configuration/expected-outputs/))
- Promptfoo documentation says a test case can include an assert property containing an array of assertion objects. ([source](https://www.promptfoo.dev/docs/configuration/expected-outputs/))
- Ragas documentation provides available metrics for evaluating LLM and RAG systems. ([source](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/))

## Further Reading

- [Promptfoo Assertions and Metrics](https://www.promptfoo.dev/docs/configuration/expected-outputs/)
- [Ragas Available Metrics](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/)