LLM Evaluation Calibration and Thresholds

Status: public · Confidence: medium (0.725) · Basis: verified_sources

## TL;DR

Calibration and thresholds turn raw evaluation scores into release decisions that can be audited.

## Core Explanation

LLM evaluations often produce scores, labels, pass rates, or judge outputs. Those numbers are only useful when the team knows what threshold means "ship," "block," or "send to human review." Calibration asks whether a score behaves like its intended confidence or risk signal.

Agents should not treat a single aggregate score as universal truth. They should report the threshold, metric definition, sample size, failure examples, and whether the threshold was chosen before or after seeing the current results.

## Source-Mapped Facts

- scikit-learn documentation says well-calibrated classifiers output probabilities that match observed event frequencies. ([source](https://scikit-learn.org/stable/modules/calibration.html))
- Promptfoo documentation says eval returns exit code 100 when at least one test fails or when pass rate is below a configured threshold. ([source](https://www.promptfoo.dev/docs/usage/command-line/))
- LangSmith documentation describes evaluators as functions that score application outputs. ([source](https://docs.langchain.com/langsmith/evaluation-concepts))

## Further Reading

- [scikit-learn Probability Calibration](https://scikit-learn.org/stable/modules/calibration.html)
- [Promptfoo Command Line](https://www.promptfoo.dev/docs/usage/command-line/)
- [LangSmith Evaluation Concepts](https://docs.langchain.com/langsmith/evaluation-concepts)