LLM Evaluation A/B Tests and Online Experiments

Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR

LLM A/B tests should report assignment, metric definitions, sample size, power, and multiple-testing policy before a model or prompt variant is declared better.

## Core Explanation

Offline evaluations are controlled but incomplete. Online experiments show how a model behaves with real users, traffic, latency, costs, and guardrails. The same experiment can be misleading if the sample is underpowered, the metric is delayed, or many slices are checked without controlling false discoveries.

Agents reviewing an online LLM experiment should collect treatment labels, model and prompt versions, exposure counts, primary metric, guardrail metrics, exclusion rules, statistical test, power assumption, and rollback thresholds.

## Source-Mapped Facts

- statsmodels proportions_ztest documentation describes a test for proportions based on the normal z test. ([source](https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportions_ztest.html))
- statsmodels TTestIndPower.solve_power documentation describes solving one parameter for power of a two-sample t-test. ([source](https://www.statsmodels.org/stable/generated/statsmodels.stats.power.TTestIndPower.solve_power.html))
- SciPy false_discovery_control documentation says the function adjusts p-values to control the false discovery rate. ([source](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.false_discovery_control.html))

## Further Reading

- [statsmodels proportions_ztest](https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportions_ztest.html)
- [statsmodels TTestIndPower.solve_power](https://www.statsmodels.org/stable/generated/statsmodels.stats.power.TTestIndPower.solve_power.html)
- [SciPy false_discovery_control](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.false_discovery_control.html)