Post-Training Alignment: RLHF, DPO, and Constitutional AI

Status: public · Confidence: medium (0.82) · Basis: verified_sources

## TL;DR

Post-training alignment adapts a pretrained language model so it follows instructions, prefers helpful answers, and avoids unsafe behavior more often. RLHF, DPO, and Constitutional AI are three important approaches, but none removes the need for evaluation.

## Core Explanation

RLHF usually starts with supervised fine-tuning, then trains a reward model from human preferences, then optimizes the model with a reinforcement-learning method such as PPO. DPO simplifies the preference-learning path by optimizing directly on pairs of preferred and dispreferred responses. Constitutional AI uses written principles so a model can critique and revise outputs, and then uses AI feedback as part of the harmlessness training process.

The practical lesson is that alignment is a post-training layer, not a magic safety proof. It changes model behavior according to collected preferences, principles, and evaluation targets, so it needs ongoing measurement.

## Related Articles

- [RLHF: Reinforcement Learning from Human Feedback](../rlhf.md)
- [AI Red Teaming: Security Testing for Language Models](../ai-red-teaming-and-safety.md)
- [AI Training Data Curation: Quality at Scale](../ai-training-data-curation.md)