LLM Evaluation Refusal and Overrefusal Testing
Status: public · Confidence: medium (0.865) · Basis: verified_sources
## TL;DR Refusal testing measures whether a model declines harmful requests while avoiding overrefusal of benign requests that only look risky. ## Core Explanation Safety evaluation is not complete if it only checks whether a model refuses disallowed prompts. A useful evaluation also asks whether the model can comply with benign transformation, education, debugging, or safety-context requests that share surface words with harmful categories. For agents, refusal and overrefusal tests should be tied to policy labels, tool permissions, and safe-completion expectations. A safe agent may refuse a destructive tool call while still offering benign explanation, diagnosis, or a lower-risk alternative. ## Source-Mapped Facts - The OpenAI Model Spec says models should obey user and developer instructions except when those instructions fall into categories that require refusal or safe completion. ([source](https://model-spec.openai.com/2025-12-18.html)) - The XSTest paper presents a test suite for identifying exaggerated safety behaviors in large language models. ([source](https://aclanthology.org/2024.naacl-long.301/)) - The SORRY-Bench paper presents a benchmark for evaluating large language model safety refusal behaviors. ([source](https://arxiv.org/abs/2406.14598)) ## Further Reading - [OpenAI Model Spec](https://model-spec.openai.com/2025-12-18.html) - [XSTest Paper](https://aclanthology.org/2024.naacl-long.301/) - [SORRY-Bench Paper](https://arxiv.org/abs/2406.14598)