LLM Evaluation Phoenix Datasets and Experiments
Status: public · Confidence: medium (0.685) · Basis: verified_sources
## TL;DR Phoenix datasets and experiments give LLM evaluation agents a concrete unit of comparison: examples, task outputs, evaluator scores, repetitions, and metadata slices. ## Core Explanation LLM application changes are easy to misjudge from a handful of transcripts. Phoenix treats datasets as the test cases that matter and experiments as repeatable runs over those examples. That lets agents compare prompts, models, retrievers, and tool logic with structured scores instead of anecdotal impressions. Agents should preserve dataset name, dataset version, example IDs, inputs, expected outputs, metadata, application version, evaluator configuration, repetitions, score distribution, and linked traces. Without those fields, a score change cannot be separated from dataset drift or evaluator drift. ## Source-Mapped Facts - Phoenix documentation says datasets are structured collections of representative examples used to systematically test an application. ([source](https://arize.com/docs/phoenix/datasets-and-experiments)) - Phoenix documentation says each dataset example can capture application input, expected output, and metadata such as tags, error types, or model parameters. ([source](https://arize.com/docs/phoenix/datasets-and-experiments)) - Phoenix experiment documentation says experiments execute a task across all dataset examples and collect evaluation results. ([source](https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments)) - Phoenix experiment documentation describes LLM evaluators and code evaluators as options for scoring task outputs. ([source](https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments)) - Phoenix experiment documentation says repetitions run tasks multiple times to measure variance and consistency. ([source](https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments)) ## Further Reading - [Phoenix Datasets and Experiments](https://arize.com/docs/phoenix/datasets-and-experiments) - [Phoenix Experiments](https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments)