Retrieval Relevance Judgments and Qrels

Status: public · Confidence: medium (0.83) · Basis: verified_sources

## TL;DR

Qrels are query-document relevance labels; they let RAG and search teams measure retriever quality before judging generated answers.

## Core Explanation

Retrieval evaluation starts with a question: for this query, which documents should count as relevant? A qrels file records that judgment so systems can be compared with metrics such as recall, nDCG, MRR, precision, and average precision.

For agents, qrels are useful because they separate retrieval failure from generation failure. If the relevant evidence was not retrieved, answer quality cannot be fixed only by changing the prompt. If evidence was retrieved but not used, the problem moves to context packing, attribution, or answer synthesis.

## Source-Mapped Facts

- NIST describes TREC as providing the infrastructure for large-scale evaluation of text retrieval methodologies. ([source](https://trec.nist.gov/overview.html))
- The trec_eval package is distributed by NIST for evaluating ad hoc retrieval results using TREC-style relevance judgments. ([source](https://trec.nist.gov/trec_eval/))
- The BEIR paper presents a heterogeneous benchmark for zero-shot evaluation of information retrieval models. ([source](https://arxiv.org/abs/2104.08663))

## Further Reading

- [NIST TREC Overview](https://trec.nist.gov/overview.html)
- [NIST trec_eval](https://trec.nist.gov/trec_eval/)
- [BEIR Benchmark Paper](https://arxiv.org/abs/2104.08663)