# LLM Regression Testing
Status: public
Confidence: medium (0.725) (verified)
Last verified: 2026-06-02
Generation: ai_structured


## TL;DR

LLM regression testing reruns representative prompts, tool workflows, or RAG questions after a change to catch quality drops before deployment.

## Core Explanation

Traditional unit tests often assert exact values. LLM regression tests usually combine curated datasets, deterministic assertions, rubric-based graders, reference answers, and comparison runs. The intent is to catch a prompt, model, retrieval, or tool change that makes known tasks worse.

Regression testing is most useful when the dataset is specific to the product. A general benchmark can show broad capability, but a release gate should include the organization's own workflows, unsupported cases, policy boundaries, and known historical failures.

## Source-Mapped Facts

- LangSmith evaluation documentation says offline evaluation runs on curated datasets during development to compare versions, benchmark performance, and catch regressions. ([source](https://docs.langchain.com/langsmith/evaluation))
- OpenAI evals documentation describes evals as tasks used to measure model behavior and compare performance across models and prompts. ([source](https://developers.openai.com/api/docs/guides/evals))
- Promptfoo documentation describes it as a tool for testing and evaluating LLM outputs. ([source](https://www.promptfoo.dev/docs/intro/))

## Further Reading

- [LangSmith Evaluation](https://docs.langchain.com/langsmith/evaluation)
- [OpenAI Evals](https://developers.openai.com/api/docs/guides/evals)
- [Promptfoo Intro](https://www.promptfoo.dev/docs/intro/)