# LLM Evaluation Run Metadata and Reproducibility
Status: public
Confidence: medium (0.725) (verified)
Last verified: 2026-06-03
Generation: ai_structured


## TL;DR

An LLM eval result is weak evidence unless the run metadata records exactly what was tested and how the result can be compared later.

## Core Explanation

Evaluation runs should capture model identifier, prompt version, tool schema version, dataset version, randomization settings, retrieval snapshot, grader version, metric definitions, and code commit. Without that metadata, a later regression cannot tell whether quality changed because of the model, prompt, dataset, retriever, or evaluator.

Agents should preserve run IDs and artifact links in issue comments, CI summaries, and experiment dashboards. For online evals, they should also record traffic assignment and guardrail metrics; for offline evals, they should record input hashes and expected-answer versions.

## Source-Mapped Facts

- MLflow Tracking documentation says its tracking APIs provide functions to track runs and log parameters and metrics. ([source](https://mlflow.github.io/mlflow-website/docs/latest/ml/tracking/))
- Weights & Biases documentation describes workflows for creating experiments, configuring experiments, logging experiment data, and viewing experiment results. ([source](https://docs.wandb.ai/guides/track/))
- DVC experiment documentation says ML experiments can be saved as they run or after they complete, and that DVC can track and compare parameters, metrics, and plots. ([source](https://dvc.org/doc/user-guide/experiment-management))

## Further Reading

- [MLflow Experiment Tracking](https://mlflow.github.io/mlflow-website/docs/latest/ml/tracking/)
- [Weights & Biases Experiments](https://docs.wandb.ai/guides/track/)
- [DVC Experiments](https://dvc.org/doc/user-guide/experiment-management)