Evaluation Sampling and Confidence Intervals

Status: public · Confidence: medium (0.725) · Basis: verified_sources

## TL;DR

Evaluation sampling and confidence intervals help teams avoid overreacting to noisy LLM benchmark differences.

## Core Explanation

An LLM evaluation score is an estimate based on a finite set of examples. Small score changes can be noise when the sample is small, the task mix changes, or the grader is unstable.

Agents should report the sample definition, metric, number of examples, and uncertainty method with evaluation results. Release decisions are stronger when they include confidence intervals or stratified analysis instead of a single point estimate.

## Source-Mapped Facts

- SciPy bootstrap documentation describes a function that computes a two-sided bootstrap confidence interval for a statistic. ([source](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html))
- statsmodels proportion_confint documentation describes methods for confidence intervals around a binomial proportion. ([source](https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportion_confint.html))
- scikit-learn cross-validation documentation says cross-validation evaluates generalization performance by splitting data into training and testing subsets. ([source](https://scikit-learn.org/stable/modules/cross_validation.html))

## Further Reading

- [SciPy Bootstrap](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html)
- [statsmodels Proportion Confidence Interval](https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportion_confint.html)
- [scikit-learn Cross-Validation](https://scikit-learn.org/stable/modules/cross_validation.html)