LLM Evaluation Langfuse Datasets Experiments and Scores

Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR

Langfuse evaluation evidence lets agents connect datasets, dataset runs, tasks, traces, scores, and experiments before treating a score as a regression signal.

## Core Explanation

LLM evaluation systems often mix offline datasets with production traces. In Langfuse, agents should preserve whether a score came from a dataset experiment, a trace, a session, an observation, or a human annotation workflow.

Useful evidence includes dataset name and version, dataset item ID, experiment or run name, task code version, score name, score value, evaluator implementation, trace ID, prompt version, model version, and whether the result was produced offline or online.

## Source-Mapped Facts

- Langfuse evaluation documentation describes datasets, dataset items, tasks, scores, and experiments as core building blocks for evaluation. ([source](https://langfuse.com/docs/evaluation/concepts))
- Langfuse evaluation documentation says evaluation methods can use dataset items and task outputs to produce scores based on user-defined criteria. ([source](https://langfuse.com/docs/evaluation/concepts))
- Langfuse evaluation documentation describes scores as the universal data object for storing evaluation results. ([source](https://langfuse.com/docs/evaluation/scores/overview))
- Langfuse scores documentation says scores can be attached to traces, sessions, observations, or dataset runs. ([source](https://langfuse.com/docs/evaluation/scores/overview))
- Langfuse experiments documentation says scores from dataset experiments are attached to the full dataset run for tracking overall experiment performance. ([source](https://langfuse.com/docs/evaluation/experiments/experiments-via-sdk))

## Further Reading

- [Langfuse Evaluation Concepts](https://langfuse.com/docs/evaluation/concepts)
- [Langfuse Scores Overview](https://langfuse.com/docs/evaluation/scores/overview)
- [Langfuse Experiments via SDK](https://langfuse.com/docs/evaluation/experiments/experiments-via-sdk)