Post-Training Alignment: DPO, GRPO, and Beyond

## TL;DR
Post-training alignment makes base models useful and safe. DPO simplified RLHF by removing the reward model; GRPO enabled DeepSeek-R1's reasoning breakthrough. The alignment pipeline (SFT→DPO/RL→reasoning) is now standard across the industry.

## Core Explanation
The alignment workflow: (1) Supervised Fine-Tuning (SFT) on instruction-following demonstrations. (2) Preference alignment: PPO-RLHF (4-model pipeline), DPO (direct optimization, no reward model), or GRPO (group-based RL, no critic). (3) Optional reasoning training. KTO and ORPO extend DPO for scenarios with only positive or unpaired feedback.

## Detailed Analysis
PPO-RLHF was the original approach (InstructGPT, ChatGPT). Its complexity — reward model training, policy optimization, KL divergence constraint — motivated simpler alternatives. DPO treats alignment as a classification problem on preference pairs. Constitutional AI (Anthropic) uses principle-based self-critique to generate training data.

## Further Reading
- Hugging Face Alignment Handbook
- OpenAI: InstructGPT Paper
- Lilian Weng: Learning from Human Preferences