# Evaluation Rubrics and Grader Design Status: public Confidence: medium (0.725) (verified) Last verified: 2026-06-02 Generation: ai_structured ## TL;DR Evaluation rubrics define what good output means. Grader design turns that rubric into human, code, model-graded, or pairwise measurements that can be run repeatedly. ## Core Explanation LLM evaluation often fails when the rubric is vague or the grader measures what is easy instead of what matters. Strong rubrics separate factual correctness, task completion, citation quality, safety, formatting, and user-impact criteria. Strong graders include reference examples, deterministic checks where possible, model-graded judgments only where needed, and human review for calibration. ## Source-Mapped Facts - OpenAI graders documentation says graders compare reference answers to model-generated answers and return a grade in the range from 0 to 1. ([source](https://platform.openai.com/docs/guides/graders/)) - LangSmith evaluation documentation says evaluators score application performance and can be attached to tracing projects or datasets. ([source](https://docs.langchain.com/langsmith/evaluation-concepts)) - LangSmith evaluation documentation lists human, code, LLM-as-judge, and pairwise approaches as supported evaluation techniques. ([source](https://docs.langchain.com/langsmith/evaluation-concepts)) ## Further Reading - [OpenAI graders](https://platform.openai.com/docs/guides/graders/) - [LangSmith evaluation concepts](https://docs.langchain.com/langsmith/evaluation-concepts) - [HELM latest results](https://crfm.stanford.edu/helm/latest/)