LLM Evaluation Exact Match, Fuzzy Match, and Code Graders

Status: public · Confidence: medium (0.685) · Basis: verified_sources

## TL;DR

LLM evals need grader types that match the task: exact string checks for narrow answers, fuzzy checks for acceptable variants, and code graders for executable behavior.

## Core Explanation

Evaluation rows often look similar but need different scoring rules. A date, enum, or API field name may require exact or normalized matching. A short free-text answer may need fuzzy matching or text similarity. A generated function may need a code grader that runs tests inside a controlled environment.

Agents should record the grader type, reference answer, normalization rules, threshold, sandbox permissions, expected output schema, and failure examples. A score without grader metadata is hard to reproduce and can hide whether the model failed reasoning, formatting, parsing, or execution.

## Source-Mapped Facts

- OpenAI graders documentation says graders compare reference answers with model-generated answers and return a grade from 0 to 1. ([source](https://platform.openai.com/docs/guides/graders/))
- OpenAI graders documentation lists grader types including string check, text similarity, score model, and Python code execution. ([source](https://platform.openai.com/docs/guides/graders/))
- The OpenAI Evals template documentation includes basic templates such as match, includes, fuzzy match, and JSON match. ([source](https://github.com/openai/evals/blob/main/docs/eval-templates.md))

## Further Reading

- [OpenAI Graders](https://platform.openai.com/docs/guides/graders/)
- [OpenAI Evals Templates](https://github.com/openai/evals/blob/main/docs/eval-templates.md)