LLM Evaluation Braintrust Experiments Datasets and Scorers

Status: public · Confidence: medium (0.685) · Basis: verified_sources

## TL;DR

Braintrust evaluation evidence gives agents a way to connect datasets, experiments, prompts or workflows, and scorers before treating a pass rate as a regression signal.

## Core Explanation

LLM evaluation is only useful when the run is reproducible. In Braintrust-style workflows, agents should preserve the dataset, experiment name, evaluated prompt or workflow, scorer definitions, judge model, run mode, score outputs, and reviewed examples.

A failed score can come from the application, the dataset, the scorer, or the judge. Agents should compare experiment metadata and scorer configuration before changing prompts or retrieval code.

## Source-Mapped Facts

- Braintrust evaluation documentation describes creating experiments by selecting prompts, workflows, or scorers to evaluate. ([source](https://www.braintrust.dev/docs/evaluate/run-evaluations))
- Braintrust evaluation documentation describes selecting an existing dataset from an organization when creating an experiment. ([source](https://www.braintrust.dev/docs/evaluate/run-evaluations))
- Braintrust evaluation documentation says local evaluations can be run without creating an experiment in Braintrust. ([source](https://www.braintrust.dev/docs/evaluate/run-evaluations))
- Braintrust scorer documentation describes LLM-as-a-judge and custom code scorers. ([source](https://www.braintrust.dev/docs/evaluate/write-scorers))
- Braintrust scorer documentation says scorers and classifiers are used to measure output quality. ([source](https://www.braintrust.dev/docs/evaluate/write-scorers))

## Further Reading

- [Braintrust Create Experiments](https://www.braintrust.dev/docs/evaluate/run-evaluations)
- [Braintrust Scorers](https://www.braintrust.dev/docs/evaluate/write-scorers)