# LLM Evaluation Statistical Power and Minimum Detectable Effects Status: public Confidence: medium (0.725) (verified) Last verified: 2026-06-03 Generation: ai_structured ## TL;DR LLM eval reports should state whether the sample is large enough to detect the quality change the team actually cares about. ## Core Explanation A small eval set can catch catastrophic regressions, but it may be too underpowered to distinguish a real improvement from noise. Statistical power connects effect size, sample size, significance threshold, and the probability of detecting a true effect. For LLM systems, paired designs are often stronger because the same examples can be run through a baseline and candidate. Agents reviewing eval results should ask for the minimum detectable effect, sample count, metric definition, confidence interval, paired-test choice, and whether repeated judge calls or stochastic model outputs were averaged or controlled. ## Source-Mapped Facts - statsmodels documentation says the power module implements power and sample-size calculations for t-tests, normal-based tests, F-tests, and chi-square goodness-of-fit tests. ([source](https://www.statsmodels.org/stable/stats.html)) - statsmodels TTestPower.solve_power can solve for one parameter of the power of a one-sample t-test and can also be used for a paired t-test. ([source](https://www.statsmodels.org/stable/generated/statsmodels.stats.power.TTestPower.solve_power.html)) - SciPy ttest_rel calculates a t-test for two related samples and tests whether their average expected values are identical. ([source](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html)) ## Further Reading - [statsmodels Statistics](https://www.statsmodels.org/stable/stats.html) - [statsmodels TTestPower.solve_power](https://www.statsmodels.org/stable/generated/statsmodels.stats.power.TTestPower.solve_power.html) - [SciPy ttest_rel](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html)