# LLM Evaluation Golden Datasets and Regression Tests
Status: public
Confidence: medium (0.725) (verified)
Last verified: 2026-06-02
Generation: ai_structured


## TL;DR

Golden datasets and regression tests let teams detect when a prompt, model, tool, or retrieval change breaks expected LLM behavior.

## Core Explanation

LLM systems need repeatable evaluations because manual spot checks miss regressions. A golden dataset contains representative examples, expected outputs or grading criteria, metadata, and historical scores. It should be versioned with the prompt, model, retrieval configuration, and tool schema.

Agents should distinguish exploratory evals from release gates. A release gate needs stable examples, deterministic evaluation policy where possible, and documented thresholds for blocking or accepting a change.

## Source-Mapped Facts

- OpenAI evaluation documentation describes datasets as collections of examples that can be used in evals. ([source](https://platform.openai.com/docs/guides/evaluation-getting-started))
- Vertex AI documentation describes evaluation datasets for Gen AI evaluation. ([source](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-dataset))
- LangSmith documentation describes pairwise, pointwise, summary, and pairwise comparative evaluation types. ([source](https://docs.langchain.com/langsmith/evaluation-types))

## Further Reading

- [OpenAI Evaluation Getting Started](https://platform.openai.com/docs/guides/evaluation-getting-started)
- [Vertex AI Evaluation Dataset](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-dataset)
- [LangSmith Evaluation Types](https://docs.langchain.com/langsmith/evaluation-types)