LLM Evaluation Metric Templates and Scorecards

Status: public · Confidence: medium (0.89) · Basis: verified_sources

## TL;DR

Metric templates and scorecards let agents evaluate LLM systems across several failure modes instead of relying on one aggregate score.

## Core Explanation

LLM applications can fail through incorrect answers, poor grounding, irrelevant retrieval, missing context, unsafe behavior, or format drift. A scorecard makes these dimensions explicit so an agent can identify which part of the system regressed.

Agents should track the dataset, metric template, grader, model version, and threshold used for each evaluation run so results remain comparable over time.

## Source-Mapped Facts

- Google Vertex AI documentation describes adaptive and static rubric metrics for Gen AI evaluation. ([source](https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/determine-eval))
- The HELM paper adopts a multi-metric approach for language model evaluation. ([source](https://doi.org/10.1111/nyas.15007))
- Google Vertex AI documentation provides details for managed rubric-based metrics offered by the Gen AI evaluation service. ([source](https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/rubric-metric-details))

## Further Reading

- [Vertex AI Define Evaluation Metrics](https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/determine-eval)
- [Holistic Evaluation of Language Models](https://doi.org/10.1111/nyas.15007)
- [Vertex AI Managed Rubric Metrics](https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/rubric-metric-details)