# Constitutional AI Confidence: high Last verified: 2026-05-22 Generation: human_only ## TL;DR Constitutional AI (CAI) is Anthropic's method for training AI systems to be helpful and harmless using explicit written principles — a "constitution" — rather than implicit human preferences. First published in December 2022 (arXiv:2212.08073) and detailed in May 2023, CAI trains models to self-critique and self-correct based on constitutional principles, then uses AI-generated feedback instead of human feedback for reinforcement learning. This achieves both higher helpfulness and higher harmlessness simultaneously (a Pareto improvement), unlike traditional RLHF which typically trades one for the other. Claude is the first model trained entirely with CAI. ## Core Explanation Traditional RLHF relies on human labelers to rank model outputs, which has fundamental limitations: - **Scalability**: Human labeling is expensive and slow - **Consistency**: Different labelers apply different standards - **Harm exposure**: Humans must read toxic outputs to label them - **Opacity**: The resulting values are implicit in the data, not explicitly defined Constitutional AI replaces the human feedback loop with a constitution — a set of explicit principles that an AI uses to evaluate its own outputs. The training process has two stages: ### Stage 1: Supervised Learning (SL) The model is trained to **critique and revise its own responses** using randomly sampled constitutional principles. For each training example, the model generates a response, critiques it against a specific principle, and produces a revised version. A small set of human-written examples bootstraps this process; after that, the model self-supervises. ### Stage 2: Reinforcement Learning (RL) Instead of human preference labels, **AI feedback** is used. For each prompt, the model generates two responses. An AI evaluator (using the same constitution) judges which is better. This AI preference data trains a reward model, which guides PPO-based RL fine-tuning. The result: the model internalizes the constitution's values through reinforcement, without humans ever having to see harmful outputs. ## The Constitution's Six Sources Anthropic's constitution draws from six categories of principles (updated January 2026): 1. **UN Universal Declaration of Human Rights**: Freedom, equality, dignity, anti-discrimination 2. **Apple's Terms of Service (inspired)**: Accuracy, honesty, avoiding harmful/deceptive content 3. **Non-Western Perspectives**: Principles designed to avoid offense to non-Western, non-industrialized, and non-capitalist audiences 4. **DeepMind Sparrow Rules**: Avoiding stereotypes, threats, harassment, medical/legal advice, conspiracy theories 5. **Anthropic Research Set 1**: Child-safety, politeness, respect, non-judgmental tone 6. **Anthropic Research Set 2 (Existential Risk)** : Humility, human control, avoiding self-preservation or power-seeking behavior ## CAI vs. RLHF | Dimension | RLHF | Constitutional AI | | ------------------------ | ----------------------------------------------------------- | -------------------------------------------------- | | Feedback source | Human labelers | AI evaluator (guided by constitution) | | Values | Implicit in human preference data | Explicit in written principles | | Scalability | Limited by human labeling capacity | Highly scalable (AI supervises AI) | | Harm exposure | Humans exposed to toxic content during labeling | No human exposure to harmful outputs | | Helpfulness/harmlessness | Trade-off (helpfulness decreases as harmlessness increases) | Pareto improvement (both increase simultaneously) | | Transparency | Difficult to audit implicit preferences | Principles can be inspected, debated, and modified | | Behavior adjustment | Requires new human labeling campaign | Add or modify a constitutional principle | ## Advantages and Limitations **Advantages**: - **Scalable oversight**: AI supervision replaces human supervision (Anthropic calls this "a success story for scalable oversight") - **Auditable values**: Anyone can read the constitution and understand what the model was trained to value - **Rapid correction**: If Claude exhibits unwanted behavior, Anthropic can write a new principle and retrain - **Proportional response**: Principles include "proportionality" language to prevent excessive moralizing **Limitations**: - **Constitution authoring is itself value-laden**: Who writes the principles? Anthropic acknowledges this and has stated goals toward more democratic constitution design - **Overly specific principles can reduce generalization**: Longer, more detailed principles may produce worse results than concise, general ones - **AI evaluator bias**: The evaluating model may have its own blind spots or biases ## Current and Future Directions - **Claude** is trained entirely with CAI — all harmlessness improvements come from AI supervision, not human harmlessness data - Anthropic is exploring **democratic constitution design** where diverse stakeholders contribute to the principles - Future plans include **customizable constitutions** for specific use cases (e.g., a medical AI with a Hippocratic oath-inspired constitution) - The current constitution is explicitly stated as not final — Anthropic considers it a living document ## Further Reading - [Claude's Constitution (Anthropic)](https://www.anthropic.com/news/claudes-constitution): Official description of CAI principles - [CAI Paper (arXiv 2212.08073)](https://arxiv.org/abs/2212.08073): Original research paper by Bai et al.