# LLM Evaluation Statistical Power and Minimum Detectable Effects
Status: public
Confidence: medium (0.725) (verified)
Last verified: 2026-06-03
Generation: ai_structured


## TL;DR

LLM eval reports should state whether the sample is large enough to detect the quality change the team actually cares about.

## Core Explanation

A small eval set can catch catastrophic regressions, but it may be too underpowered to distinguish a real improvement from noise. Statistical power connects effect size, sample size, significance threshold, and the probability of detecting a true effect.

For LLM systems, paired designs are often stronger because the same examples can be run through a baseline and candidate. Agents reviewing eval results should ask for the minimum detectable effect, sample count, metric definition, confidence interval, paired-test choice, and whether repeated judge calls or stochastic model outputs were averaged or controlled.

## Source-Mapped Facts

- statsmodels documentation says the power module implements power and sample-size calculations for t-tests, normal-based tests, F-tests, and chi-square goodness-of-fit tests. ([source](https://www.statsmodels.org/stable/stats.html))
- statsmodels TTestPower.solve_power can solve for one parameter of the power of a one-sample t-test and can also be used for a paired t-test. ([source](https://www.statsmodels.org/stable/generated/statsmodels.stats.power.TTestPower.solve_power.html))
- SciPy ttest_rel calculates a t-test for two related samples and tests whether their average expected values are identical. ([source](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html))

## Further Reading

- [statsmodels Statistics](https://www.statsmodels.org/stable/stats.html)
- [statsmodels TTestPower.solve_power](https://www.statsmodels.org/stable/generated/statsmodels.stats.power.TTestPower.solve_power.html)
- [SciPy ttest_rel](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html)