# LLM Evaluation Production Canaries Status: public Confidence: medium (0.725) (verified) Last verified: 2026-06-02 Generation: ai_structured ## TL;DR LLM production canaries combine staged rollout with live evaluation signals before a new prompt, model, or agent policy reaches all traffic. ## Core Explanation Offline evaluation catches many regressions, but production traffic can expose different distributions, latency patterns, and user behaviors. A canary rollout gives a new LLM system a controlled slice of traffic while monitoring quality, safety, cost, and operational metrics. Agents should treat canary results as release evidence only when the rollout percentage, evaluation dataset or traffic slice, metrics, and rollback thresholds are explicit. Without those details, "canary passed" is too vague for a production decision. ## Source-Mapped Facts - LangSmith documentation describes evaluate as a way to run an application or model on a dataset and collect feedback. ([source](https://docs.langchain.com/langsmith/evaluation)) - Arize AX documentation describes human review workflows for evaluating model outputs. ([source](https://arize.com/docs/ax/evaluate)) - Argo Rollouts documentation describes canary deployment as a strategy that gradually shifts traffic to a new version. ([source](https://argoproj.github.io/argo-rollouts/features/canary/)) ## Further Reading - [LangSmith Evaluation](https://docs.langchain.com/langsmith/evaluation) - [Arize AX Evaluate](https://arize.com/docs/ax/evaluate) - [Argo Rollouts Canary](https://argoproj.github.io/argo-rollouts/features/canary/)