Constitutional AI
Status: public · Confidence: medium (0.74) · Basis: verified_sources
## TL;DR Constitutional AI (CAI) is Anthropic's method for training AI systems to be helpful and harmless using explicit written principles — a "constitution" — rather than implicit human preferences. First published in December 2022 (arXiv:2212.08073), CAI trains models to self-critique and self-correct based on constitutional principles, then uses AI-generated feedback during reinforcement learning. ## Core Explanation Traditional RLHF relies on human labelers to rank model outputs, which has fundamental limitations: - **Scalability**: Human labeling is expensive and slow - **Consistency**: Different labelers apply different standards - **Harm exposure**: Humans must read toxic outputs to label them - **Opacity**: The resulting values are implicit in the data, not explicitly defined Constitutional AI replaces the human feedback loop with a constitution — a set of explicit principles that an AI uses to evaluate its own outputs. The training process has two stages: ### Stage 1: Supervised Learning (SL) The model is trained to **critique and revise its own responses** using randomly sampled constitutional principles. For each training example, the model generates a response, critiques it against a specific principle, and produces a revised version. A small set of human-written examples bootstraps this process; after that, the model self-supervises. ### Stage 2: Reinforcement Learning (RL) Instead of human preference labels, **AI feedback** is used. For each prompt, the model generates two responses. An AI evaluator (using the same constitution) judges which is better. This AI preference data trains a reward model, which guides PPO-based RL fine-tuning. The result: the model internalizes the constitution's values through reinforcement, without humans ever having to see harmful outputs. ## The Constitution's Six Sources Anthropic's constitution draws from six categories of principles (updated January 2026): 1. **UN Universal Declaration of Human Rights**: Freedom, equality, dignity, anti-discrimination 2. **Apple's Terms of Service (inspired)**: Accuracy, honesty, avoiding harmful/deceptive content 3. **Non-Western Perspectives**: Principles designed to avoid offense to non-Western, non-industrialized, and non-capitalist audiences 4. **DeepMind Sparrow Rules**: Avoiding stereotypes, threats, harassment, medical/legal advice, conspiracy theories 5. **Anthropic Research Set 1**: Child-safety, politeness, respect, non-judgmental tone 6. **Anthropic Research Set 2 (Existential Risk)** : Humility, human control, avoiding self-preservation or power-seeking behavior ## CAI vs. RLHF | Dimension | RLHF | Constitutional AI | | ------------------------ | ----------------------------------------------------------- | -------------------------------------------------- | | Feedback source | Human labelers | AI evaluator (guided by constitution) | | Values | Implicit in human preference data | Explicit in written principles | | Scalability | Limited by human labeling capacity | Highly scalable (AI supervises AI) | | Harm exposure | Humans exposed to toxic content during labeling | No human exposure to harmful outputs | | Helpfulness/harmlessness | Trade-off (helpfulness decreases as harmlessness increases) | Pareto improvement (both increase simultaneously) | | Transparency | Difficult to audit implicit preferences | Principles can be inspected, debated, and modified | | Behavior adjustment | Requires new human labeling campaign | Add or modify a constitutional principle | ## Advantages and Limitations **Advantages**: - **Scalable oversight**: AI supervision replaces human supervision (Anthropic calls this "a success story for scalable oversight") - **Auditable values**: Anyone can read the constitution and understand what the model was trained to value - **Rapid correction**: If Claude exhibits unwanted behavior, Anthropic can write a new principle and retrain - **Proportional response**: Principles include "proportionality" language to prevent excessive moralizing **Limitations**: - **Constitution authoring is itself value-laden**: Who writes the principles? Anthropic acknowledges this and has stated goals toward more democratic constitution design - **Overly specific principles can reduce generalization**: Longer, more detailed principles may produce worse results than concise, general ones - **AI evaluator bias**: The evaluating model may have its own blind spots or biases ## Current and Future Directions - **Claude** is trained entirely with CAI — all harmlessness improvements come from AI supervision, not human harmlessness data - Anthropic is exploring **democratic constitution design** where diverse stakeholders contribute to the principles - Future plans include **customizable constitutions** for specific use cases (e.g., a medical AI with a Hippocratic oath-inspired constitution) - The current constitution is explicitly stated as not final — Anthropic considers it a living document ## Further Reading - [Claude's Constitution (Anthropic)](https://www.anthropic.com/news/claudes-constitution): Official description of CAI principles - [CAI Paper (arXiv 2212.08073)](https://arxiv.org/abs/2212.08073): Original research paper by Bai et al.