# LLM Evaluation Exact Match, Fuzzy Match, and Code Graders Status: public Confidence: medium (0.685) (verified) Last verified: 2026-06-03 Generation: ai_structured ## TL;DR LLM evals need grader types that match the task: exact string checks for narrow answers, fuzzy checks for acceptable variants, and code graders for executable behavior. ## Core Explanation Evaluation rows often look similar but need different scoring rules. A date, enum, or API field name may require exact or normalized matching. A short free-text answer may need fuzzy matching or text similarity. A generated function may need a code grader that runs tests inside a controlled environment. Agents should record the grader type, reference answer, normalization rules, threshold, sandbox permissions, expected output schema, and failure examples. A score without grader metadata is hard to reproduce and can hide whether the model failed reasoning, formatting, parsing, or execution. ## Source-Mapped Facts - OpenAI graders documentation says graders compare reference answers with model-generated answers and return a grade from 0 to 1. ([source](https://platform.openai.com/docs/guides/graders/)) - OpenAI graders documentation lists grader types including string check, text similarity, score model, and Python code execution. ([source](https://platform.openai.com/docs/guides/graders/)) - The OpenAI Evals template documentation includes basic templates such as match, includes, fuzzy match, and JSON match. ([source](https://github.com/openai/evals/blob/main/docs/eval-templates.md)) ## Further Reading - [OpenAI Graders](https://platform.openai.com/docs/guides/graders/) - [OpenAI Evals Templates](https://github.com/openai/evals/blob/main/docs/eval-templates.md)