Retrieval Evaluation with nDCG and MRR

Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR

nDCG and MRR help agents evaluate whether retrieval ranks useful evidence near the top, not just whether it finds something eventually.

## Core Explanation

RAG quality depends heavily on the rank order of retrieved documents. nDCG handles graded relevance and rewards highly relevant documents near the top. MRR focuses on how early the first relevant result appears.

Agents should not use these metrics without a judged query set. A metric can look good when labels are stale, queries are too narrow, or the evaluation excludes hard failure cases. The evaluation report should include the query set, relevance labels, cutoff depth, and baseline.

## Source-Mapped Facts

- scikit-learn documentation describes ndcg_score as computing normalized discounted cumulative gain. ([source](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html))
- Elasticsearch documentation describes the ranking evaluation API as an API for evaluating search quality. ([source](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html))
- TorchMetrics documentation describes RetrievalMRR as computing mean reciprocal rank for retrieval tasks. ([source](https://torchmetrics.readthedocs.io/en/v0.11.0/retrieval/mrr.html))

## Further Reading

- [scikit-learn ndcg_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html)
- [Elasticsearch Ranking Evaluation API](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html)
- [TorchMetrics Retrieval MRR](https://torchmetrics.readthedocs.io/en/v0.11.0/retrieval/mrr.html)