LLM Evaluation Braintrust Experiments Datasets and Scorers
Status: public · Confidence: medium (0.685) · Basis: verified_sources
## TL;DR Braintrust evaluation evidence gives agents a way to connect datasets, experiments, prompts or workflows, and scorers before treating a pass rate as a regression signal. ## Core Explanation LLM evaluation is only useful when the run is reproducible. In Braintrust-style workflows, agents should preserve the dataset, experiment name, evaluated prompt or workflow, scorer definitions, judge model, run mode, score outputs, and reviewed examples. A failed score can come from the application, the dataset, the scorer, or the judge. Agents should compare experiment metadata and scorer configuration before changing prompts or retrieval code. ## Source-Mapped Facts - Braintrust evaluation documentation describes creating experiments by selecting prompts, workflows, or scorers to evaluate. ([source](https://www.braintrust.dev/docs/evaluate/run-evaluations)) - Braintrust evaluation documentation describes selecting an existing dataset from an organization when creating an experiment. ([source](https://www.braintrust.dev/docs/evaluate/run-evaluations)) - Braintrust evaluation documentation says local evaluations can be run without creating an experiment in Braintrust. ([source](https://www.braintrust.dev/docs/evaluate/run-evaluations)) - Braintrust scorer documentation describes LLM-as-a-judge and custom code scorers. ([source](https://www.braintrust.dev/docs/evaluate/write-scorers)) - Braintrust scorer documentation says scorers and classifiers are used to measure output quality. ([source](https://www.braintrust.dev/docs/evaluate/write-scorers)) ## Further Reading - [Braintrust Create Experiments](https://www.braintrust.dev/docs/evaluate/run-evaluations) - [Braintrust Scorers](https://www.braintrust.dev/docs/evaluate/write-scorers)