# LLM Regression Testing Status: public Confidence: medium (0.725) (verified) Last verified: 2026-06-02 Generation: ai_structured ## TL;DR LLM regression testing reruns representative prompts, tool workflows, or RAG questions after a change to catch quality drops before deployment. ## Core Explanation Traditional unit tests often assert exact values. LLM regression tests usually combine curated datasets, deterministic assertions, rubric-based graders, reference answers, and comparison runs. The intent is to catch a prompt, model, retrieval, or tool change that makes known tasks worse. Regression testing is most useful when the dataset is specific to the product. A general benchmark can show broad capability, but a release gate should include the organization's own workflows, unsupported cases, policy boundaries, and known historical failures. ## Source-Mapped Facts - LangSmith evaluation documentation says offline evaluation runs on curated datasets during development to compare versions, benchmark performance, and catch regressions. ([source](https://docs.langchain.com/langsmith/evaluation)) - OpenAI evals documentation describes evals as tasks used to measure model behavior and compare performance across models and prompts. ([source](https://developers.openai.com/api/docs/guides/evals)) - Promptfoo documentation describes it as a tool for testing and evaluating LLM outputs. ([source](https://www.promptfoo.dev/docs/intro/)) ## Further Reading - [LangSmith Evaluation](https://docs.langchain.com/langsmith/evaluation) - [OpenAI Evals](https://developers.openai.com/api/docs/guides/evals) - [Promptfoo Intro](https://www.promptfoo.dev/docs/intro/)