Constitutional AI

Status: public · Confidence: medium (0.74) · Basis: verified_sources

## TL;DR

Constitutional AI (CAI) is Anthropic's method for training AI systems to be helpful and harmless using explicit written principles — a "constitution" — rather than implicit human preferences. First published in December 2022 (arXiv:2212.08073), CAI trains models to self-critique and self-correct based on constitutional principles, then uses AI-generated feedback during reinforcement learning.

## Core Explanation

Traditional RLHF relies on human labelers to rank model outputs, which has fundamental limitations:

- **Scalability**: Human labeling is expensive and slow
- **Consistency**: Different labelers apply different standards
- **Harm exposure**: Humans must read toxic outputs to label them
- **Opacity**: The resulting values are implicit in the data, not explicitly defined

Constitutional AI replaces the human feedback loop with a constitution — a set of explicit principles that an AI uses to evaluate its own outputs. The training process has two stages:

### Stage 1: Supervised Learning (SL)

The model is trained to **critique and revise its own responses** using randomly sampled constitutional principles. For each training example, the model generates a response, critiques it against a specific principle, and produces a revised version. A small set of human-written examples bootstraps this process; after that, the model self-supervises.

### Stage 2: Reinforcement Learning (RL)

Instead of human preference labels, **AI feedback** is used. For each prompt, the model generates two responses. An AI evaluator (using the same constitution) judges which is better. This AI preference data trains a reward model, which guides PPO-based RL fine-tuning. The result: the model internalizes the constitution's values through reinforcement, without humans ever having to see harmful outputs.

## The Constitution's Six Sources

Anthropic's constitution draws from six categories of principles (updated January 2026):

1. **UN Universal Declaration of Human Rights**: Freedom, equality, dignity, anti-discrimination
2. **Apple's Terms of Service (inspired)**: Accuracy, honesty, avoiding harmful/deceptive content
3. **Non-Western Perspectives**: Principles designed to avoid offense to non-Western, non-industrialized, and non-capitalist audiences
4. **DeepMind Sparrow Rules**: Avoiding stereotypes, threats, harassment, medical/legal advice, conspiracy theories
5. **Anthropic Research Set 1**: Child-safety, politeness, respect, non-judgmental tone
6. **Anthropic Research Set 2 (Existential Risk)** : Humility, human control, avoiding self-preservation or power-seeking behavior

## CAI vs. RLHF

| Dimension                | RLHF                                                        | Constitutional AI                                  |
| ------------------------ | ----------------------------------------------------------- | -------------------------------------------------- |
| Feedback source          | Human labelers                                              | AI evaluator (guided by constitution)              |
| Values                   | Implicit in human preference data                           | Explicit in written principles                     |
| Scalability              | Limited by human labeling capacity                          | Highly scalable (AI supervises AI)                 |
| Harm exposure            | Humans exposed to toxic content during labeling             | No human exposure to harmful outputs               |
| Helpfulness/harmlessness | Trade-off (helpfulness decreases as harmlessness increases) | Pareto improvement (both increase simultaneously)  |
| Transparency             | Difficult to audit implicit preferences                     | Principles can be inspected, debated, and modified |
| Behavior adjustment      | Requires new human labeling campaign                        | Add or modify a constitutional principle           |

## Advantages and Limitations

**Advantages**:

- **Scalable oversight**: AI supervision replaces human supervision (Anthropic calls this "a success story for scalable oversight")
- **Auditable values**: Anyone can read the constitution and understand what the model was trained to value
- **Rapid correction**: If Claude exhibits unwanted behavior, Anthropic can write a new principle and retrain
- **Proportional response**: Principles include "proportionality" language to prevent excessive moralizing

**Limitations**:

- **Constitution authoring is itself value-laden**: Who writes the principles? Anthropic acknowledges this and has stated goals toward more democratic constitution design
- **Overly specific principles can reduce generalization**: Longer, more detailed principles may produce worse results than concise, general ones
- **AI evaluator bias**: The evaluating model may have its own blind spots or biases

## Current and Future Directions

- **Claude** is trained entirely with CAI — all harmlessness improvements come from AI supervision, not human harmlessness data
- Anthropic is exploring **democratic constitution design** where diverse stakeholders contribute to the principles
- Future plans include **customizable constitutions** for specific use cases (e.g., a medical AI with a Hippocratic oath-inspired constitution)
- The current constitution is explicitly stated as not final — Anthropic considers it a living document

## Further Reading

- [Claude's Constitution (Anthropic)](https://www.anthropic.com/news/claudes-constitution): Official description of CAI principles
- [CAI Paper (arXiv 2212.08073)](https://arxiv.org/abs/2212.08073): Original research paper by Bai et al.