# LLM Evaluation Refusal and Overrefusal Testing Status: public Confidence: medium (0.865) (verified) Last verified: 2026-06-02 Generation: ai_structured ## TL;DR Refusal testing measures whether a model declines harmful requests while avoiding overrefusal of benign requests that only look risky. ## Core Explanation Safety evaluation is not complete if it only checks whether a model refuses disallowed prompts. A useful evaluation also asks whether the model can comply with benign transformation, education, debugging, or safety-context requests that share surface words with harmful categories. For agents, refusal and overrefusal tests should be tied to policy labels, tool permissions, and safe-completion expectations. A safe agent may refuse a destructive tool call while still offering benign explanation, diagnosis, or a lower-risk alternative. ## Source-Mapped Facts - The OpenAI Model Spec says models should obey user and developer instructions except when those instructions fall into categories that require refusal or safe completion. ([source](https://model-spec.openai.com/2025-12-18.html)) - The XSTest paper presents a test suite for identifying exaggerated safety behaviors in large language models. ([source](https://aclanthology.org/2024.naacl-long.301/)) - The SORRY-Bench paper presents a benchmark for evaluating large language model safety refusal behaviors. ([source](https://arxiv.org/abs/2406.14598)) ## Further Reading - [OpenAI Model Spec](https://model-spec.openai.com/2025-12-18.html) - [XSTest Paper](https://aclanthology.org/2024.naacl-long.301/) - [SORRY-Bench Paper](https://arxiv.org/abs/2406.14598)