---
id: kb-2026-00020
title: Constitutional AI
schema_type: TechArticle
category: ai
language: en
confidence: high
last_verified: "2026-05-22"
created_date: "2026-05-22"
generation_method: ai_structured
ai_models:
  - claude-opus
derived_from_human_seed: true
conflict_of_interest: none_declared
is_live_document: false
data_period: static
atomic_facts:
  - id: fact-ai-001
    statement: >-
      Anthropic introduced Constitutional AI in the December 2022 arXiv paper "Constitutional AI: Harmlessness from AI Feedback" (arXiv:2212.08073).
    source_title: "Constitutional AI: Harmlessness from AI Feedback"
    source_url: https://arxiv.org/abs/2212.08073
    confidence: medium
  - id: fact-ai-002
    statement: >-
      Constitutional AI trains models to critique and revise their own responses using constitutional principles, then uses AI-generated preference feedback during reinforcement learning.
    source_title: "Constitutional AI: Harmlessness from AI Feedback"
    source_url: https://arxiv.org/abs/2212.08073
    confidence: medium
  - id: fact-ai-003
    statement: The Constitutional AI paper reports a helpfulness-harmlessness tradeoff curve in which AI feedback improves harmlessness while retaining helpfulness.
    source_title: "Constitutional AI: Harmlessness from AI Feedback"
    source_url: https://arxiv.org/abs/2212.08073
    confidence: medium
  - id: fact-ai-004
    statement: The Constitutional AI training process includes a supervised stage for critique-and-revision followed by reinforcement learning from AI-generated preference feedback.
    source_title: "Constitutional AI: Harmlessness from AI Feedback"
    source_url: https://arxiv.org/abs/2212.08073
    confidence: medium
completeness: 0.85
known_gaps:
  - Statistics and data cited are from 2024 and earlier; more recent developments may have become available since publication
  - Certain sub-topics are covered at a general level; specialized edge cases and nuanced applications may not be fully addressed
disputed_statements:
  - statement: Whether constitutional principles can adequately capture all ethical considerations without introducing unintended biases remains an open research question
primary_sources:
  - title: "Constitutional AI: Harmlessness from AI Feedback"
    authors:
      - Bai
      - Kadavath
      - Kundu
      - et al.
    type: academic_paper
    year: 2022
    url: https://arxiv.org/abs/2212.08073
    institution: Anthropic
  - title: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
    authors:
      - Bai
      - Jones
      - Kamalu
      - et al.
    type: academic_paper
    year: 2022
    url: https://arxiv.org/abs/2204.05862
    institution: Anthropic
secondary_sources:
  - title: Deep Reinforcement Learning from Human Preferences (RLHF foundation)
    authors:
      - Christiano
      - Leike
      - Brown
      - et al.
    type: academic_paper
    year: 2017
    url: https://arxiv.org/abs/1706.03741
    institution: OpenAI / DeepMind
updated: "2026-05-24"
---
## TL;DR

Constitutional AI (CAI) is Anthropic's method for training AI systems to be helpful and harmless using explicit written principles — a "constitution" — rather than implicit human preferences. First published in December 2022 (arXiv:2212.08073), CAI trains models to self-critique and self-correct based on constitutional principles, then uses AI-generated feedback during reinforcement learning.

## Core Explanation

Traditional RLHF relies on human labelers to rank model outputs, which has fundamental limitations:

- **Scalability**: Human labeling is expensive and slow
- **Consistency**: Different labelers apply different standards
- **Harm exposure**: Humans must read toxic outputs to label them
- **Opacity**: The resulting values are implicit in the data, not explicitly defined

Constitutional AI replaces the human feedback loop with a constitution — a set of explicit principles that an AI uses to evaluate its own outputs. The training process has two stages:

### Stage 1: Supervised Learning (SL)

The model is trained to **critique and revise its own responses** using randomly sampled constitutional principles. For each training example, the model generates a response, critiques it against a specific principle, and produces a revised version. A small set of human-written examples bootstraps this process; after that, the model self-supervises.

### Stage 2: Reinforcement Learning (RL)

Instead of human preference labels, **AI feedback** is used. For each prompt, the model generates two responses. An AI evaluator (using the same constitution) judges which is better. This AI preference data trains a reward model, which guides PPO-based RL fine-tuning. The result: the model internalizes the constitution's values through reinforcement, without humans ever having to see harmful outputs.

## The Constitution's Six Sources

Anthropic's constitution draws from six categories of principles (updated January 2026):

1. **UN Universal Declaration of Human Rights**: Freedom, equality, dignity, anti-discrimination
2. **Apple's Terms of Service (inspired)**: Accuracy, honesty, avoiding harmful/deceptive content
3. **Non-Western Perspectives**: Principles designed to avoid offense to non-Western, non-industrialized, and non-capitalist audiences
4. **DeepMind Sparrow Rules**: Avoiding stereotypes, threats, harassment, medical/legal advice, conspiracy theories
5. **Anthropic Research Set 1**: Child-safety, politeness, respect, non-judgmental tone
6. **Anthropic Research Set 2 (Existential Risk)** : Humility, human control, avoiding self-preservation or power-seeking behavior

## CAI vs. RLHF

| Dimension                | RLHF                                                        | Constitutional AI                                  |
| ------------------------ | ----------------------------------------------------------- | -------------------------------------------------- |
| Feedback source          | Human labelers                                              | AI evaluator (guided by constitution)              |
| Values                   | Implicit in human preference data                           | Explicit in written principles                     |
| Scalability              | Limited by human labeling capacity                          | Highly scalable (AI supervises AI)                 |
| Harm exposure            | Humans exposed to toxic content during labeling             | No human exposure to harmful outputs               |
| Helpfulness/harmlessness | Trade-off (helpfulness decreases as harmlessness increases) | Pareto improvement (both increase simultaneously)  |
| Transparency             | Difficult to audit implicit preferences                     | Principles can be inspected, debated, and modified |
| Behavior adjustment      | Requires new human labeling campaign                        | Add or modify a constitutional principle           |

## Advantages and Limitations

**Advantages**:

- **Scalable oversight**: AI supervision replaces human supervision (Anthropic calls this "a success story for scalable oversight")
- **Auditable values**: Anyone can read the constitution and understand what the model was trained to value
- **Rapid correction**: If Claude exhibits unwanted behavior, Anthropic can write a new principle and retrain
- **Proportional response**: Principles include "proportionality" language to prevent excessive moralizing

**Limitations**:

- **Constitution authoring is itself value-laden**: Who writes the principles? Anthropic acknowledges this and has stated goals toward more democratic constitution design
- **Overly specific principles can reduce generalization**: Longer, more detailed principles may produce worse results than concise, general ones
- **AI evaluator bias**: The evaluating model may have its own blind spots or biases

## Current and Future Directions

- **Claude** is trained entirely with CAI — all harmlessness improvements come from AI supervision, not human harmlessness data
- Anthropic is exploring **democratic constitution design** where diverse stakeholders contribute to the principles
- Future plans include **customizable constitutions** for specific use cases (e.g., a medical AI with a Hippocratic oath-inspired constitution)
- The current constitution is explicitly stated as not final — Anthropic considers it a living document

## Further Reading

- [Claude's Constitution (Anthropic)](https://www.anthropic.com/news/claudes-constitution): Official description of CAI principles
- [CAI Paper (arXiv 2212.08073)](https://arxiv.org/abs/2212.08073): Original research paper by Bai et al.
