LLM Evaluation Slice-Based Regression Analysis

Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR

Slice-based regression analysis checks whether an LLM system got worse for a specific task segment even when the aggregate eval score looks stable.

## Core Explanation

Aggregate scores hide failures. A model can improve on easy examples while regressing on long-context tasks, tool calls, refusals, multilingual prompts, specific customers, or high-value workflows. Slice-based analysis tags evaluation rows with dimensions such as task type, difficulty, source domain, policy category, language, and tool path.

Agents and evaluation pipelines should report both global score and slice deltas. A release should be blocked when a critical slice regresses beyond an agreed tolerance, even if the total score still passes.

## Source-Mapped Facts

- OpenAI Evals API documentation describes an evaluation as testing criteria plus a data source configuration that defines the schema of data used in the evaluation. ([source](https://platform.openai.com/docs/api-reference/evals))
- TensorFlow Model Analysis documentation describes slicing metrics to analyze model performance on more granular segments of an evaluation dataset. ([source](https://blog.tensorflow.org/2018/03/introducing-tensorflow-model-analysis.html))
- OpenAI's SWE-bench Verified announcement says released human annotations enable slicing the dataset by difficulty. ([source](https://openai.com/index/introducing-swe-bench-verified/))

## Further Reading

- [OpenAI Evals API Reference](https://platform.openai.com/docs/api-reference/evals)
- [Introducing TensorFlow Model Analysis](https://blog.tensorflow.org/2018/03/introducing-tensorflow-model-analysis.html)
- [Introducing SWE-bench Verified](https://openai.com/index/introducing-swe-bench-verified/)