LLM Evaluation Refusal and Overrefusal Testing

Status: public · Confidence: medium (0.865) · Basis: verified_sources

## TL;DR

Refusal testing measures whether a model declines harmful requests while avoiding overrefusal of benign requests that only look risky.

## Core Explanation

Safety evaluation is not complete if it only checks whether a model refuses disallowed prompts. A useful evaluation also asks whether the model can comply with benign transformation, education, debugging, or safety-context requests that share surface words with harmful categories.

For agents, refusal and overrefusal tests should be tied to policy labels, tool permissions, and safe-completion expectations. A safe agent may refuse a destructive tool call while still offering benign explanation, diagnosis, or a lower-risk alternative.

## Source-Mapped Facts

- The OpenAI Model Spec says models should obey user and developer instructions except when those instructions fall into categories that require refusal or safe completion. ([source](https://model-spec.openai.com/2025-12-18.html))
- The XSTest paper presents a test suite for identifying exaggerated safety behaviors in large language models. ([source](https://aclanthology.org/2024.naacl-long.301/))
- The SORRY-Bench paper presents a benchmark for evaluating large language model safety refusal behaviors. ([source](https://arxiv.org/abs/2406.14598))

## Further Reading

- [OpenAI Model Spec](https://model-spec.openai.com/2025-12-18.html)
- [XSTest Paper](https://aclanthology.org/2024.naacl-long.301/)
- [SORRY-Bench Paper](https://arxiv.org/abs/2406.14598)