# Reinforcement Learning from Human Feedback (RLHF) Status: public Confidence: medium (0.855) (verified) Last verified: 2026-05-30 Generation: ai_structured ## TL;DR Reinforcement Learning from Human Feedback, or RLHF, is a training pattern that uses human preference judgments to shape model behavior. In language-model training, the familiar form is supervised fine-tuning, reward-model training from comparisons, and policy optimization against that reward model. ## Core Claims RLHF starts from preference data rather than only from demonstrations. Human comparisons are used to train a reward model that predicts which outputs people prefer. InstructGPT made this pattern central for instruction-following language models. Its pipeline combined supervised fine-tuning, a reward model trained on ranked outputs, and PPO-based reinforcement learning. DPO is a useful boundary case: it also uses preference data, but optimizes a policy through a direct objective rather than the same explicit reward-model plus reinforcement-learning loop used in standard RLHF. ## Citation Boundaries Use this article for stable RLHF concepts. Do not use it for claims about the private alignment methods of a current commercial model, or for claims that RLHF fully solves safety, truthfulness, or reward hacking. ## Further Reading - [Deep Reinforcement Learning from Human Preferences](https://arxiv.org/abs/1706.03741) - [Training Language Models to Follow Instructions with Human Feedback](https://arxiv.org/abs/2203.02155) - [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290)