LLM Evaluation Inter-Annotator Agreement

Status: public · Confidence: medium (0.725) · Basis: verified_sources

## TL;DR

Inter-annotator agreement helps LLM evaluation teams measure whether human graders apply a rubric consistently.

## Core Explanation

Human labels are often treated as gold data, but labels are only useful when annotators understand the task similarly. Agreement metrics can expose ambiguous rubrics, unstable categories, and examples that need adjudication.

Agents should use agreement as a quality signal, not a replacement for expert review. Low agreement suggests the evaluation dataset or rubric needs repair before scores are used for release decisions.

## Source-Mapped Facts

- Label Studio statistics documentation includes inter-annotator agreement metrics for annotation projects. ([source](https://labelstud.io/guide/stats.html))
- Argilla annotation metrics documentation includes Fleiss' kappa and Krippendorff's alpha metrics. ([source](https://docs.v1.argilla.io/en/v2.2.0/reference/python/python_annotation_metrics.html))
- Prodigy API documentation includes an inter-annotator agreement component for comparing annotations. ([source](https://prodi.gy/docs/api-components#iaa))

## Further Reading

- [Label Studio Statistics](https://labelstud.io/guide/stats.html)
- [Argilla Annotation Metrics](https://docs.v1.argilla.io/en/v2.2.0/reference/python/python_annotation_metrics.html)
- [Prodigy Inter-Annotator Agreement](https://prodi.gy/docs/api-components#iaa)