LLM Evaluation Hallucination and Factuality Benchmarks

Status: public · Confidence: medium (0.82) · Basis: verified_sources
## TL;DR

Hallucination and factuality benchmarks test whether LLM outputs are supported by evidence rather than fluent but unsupported claims.

## Core Explanation

Factuality evaluation can use question-answering, claim verification, or hallucination detection benchmarks. For agent systems, the important split is whether the model fabricated content, misread retrieved evidence, cited the wrong source, or answered a question that lacked enough evidence.

Benchmarks provide reusable test cases, but production systems still need local evals tied to the corpus, citation contract, and freshness boundaries that users rely on.

## Source-Mapped Facts

- The TruthfulQA paper introduces a benchmark for measuring whether a language model is truthful in generating answers to questions. ([source](https://aclanthology.org/2022.acl-long.229/))
- The FEVER paper introduces a dataset for fact extraction and verification against textual sources. ([source](https://aclanthology.org/N18-1074/))
- The HaluEval paper introduces a large-scale benchmark for evaluating hallucination in large language models. ([source](https://aclanthology.org/2023.emnlp-main.397/))

## Further Reading

- [TruthfulQA Paper](https://aclanthology.org/2022.acl-long.229/)
- [FEVER Paper](https://aclanthology.org/N18-1074/)
- [HaluEval Paper](https://aclanthology.org/2023.emnlp-main.397/)