LLM Evaluation Judge Prompt Rubrics and Scorecards

Status: public · Confidence: medium (0.725) · Basis: verified_sources
## TL;DR

Judge prompts, rubrics, and scorecards turn subjective LLM quality judgments into repeatable evaluation artifacts that can be reviewed and versioned.

## Core Explanation

An LLM-as-judge setup is only as useful as its rubric. The judge needs explicit criteria, allowed score labels, examples of borderline cases, evidence requirements, and a method for handling invalid or ungrounded responses.

Agents should inspect the judge prompt, dataset rows, grader type, scoring scale, evaluator version, calibration set, disagreement rate, and human override process before treating a scorecard as a production gate.

## Source-Mapped Facts

- OpenAI Evals documentation describes testing_criteria as defining how to decide whether model output satisfies requirements for each dataset item. ([source](https://developers.openai.com/api/docs/guides/evals))
- LangSmith documentation describes evaluation as a way to assess application performance using datasets and evaluators. ([source](https://docs.langchain.com/langsmith/evaluation))
- Braintrust documentation describes evaluations as running an AI application against test data and scoring the results with scorers. ([source](https://www.braintrust.dev/docs/evaluate))

## Further Reading

- [OpenAI Working with Evals](https://developers.openai.com/api/docs/guides/evals)
- [OpenAI Graders](https://developers.openai.com/api/docs/guides/graders)
- [Braintrust Evaluate Systematically](https://www.braintrust.dev/docs/evaluate)