# LLM Evaluation Hallucination and Factuality Benchmarks Status: public Confidence: medium (0.82) (verified) Last verified: 2026-06-02 Generation: ai_structured ## TL;DR Hallucination and factuality benchmarks test whether LLM outputs are supported by evidence rather than fluent but unsupported claims. ## Core Explanation Factuality evaluation can use question-answering, claim verification, or hallucination detection benchmarks. For agent systems, the important split is whether the model fabricated content, misread retrieved evidence, cited the wrong source, or answered a question that lacked enough evidence. Benchmarks provide reusable test cases, but production systems still need local evals tied to the corpus, citation contract, and freshness boundaries that users rely on. ## Source-Mapped Facts - The TruthfulQA paper introduces a benchmark for measuring whether a language model is truthful in generating answers to questions. ([source](https://aclanthology.org/2022.acl-long.229/)) - The FEVER paper introduces a dataset for fact extraction and verification against textual sources. ([source](https://aclanthology.org/N18-1074/)) - The HaluEval paper introduces a large-scale benchmark for evaluating hallucination in large language models. ([source](https://aclanthology.org/2023.emnlp-main.397/)) ## Further Reading - [TruthfulQA Paper](https://aclanthology.org/2022.acl-long.229/) - [FEVER Paper](https://aclanthology.org/N18-1074/) - [HaluEval Paper](https://aclanthology.org/2023.emnlp-main.397/)