# Reinforcement Learning from Human Feedback (RLHF)
Status: public
Confidence: medium (0.855) (verified)
Last verified: 2026-05-30
Generation: ai_structured


## TL;DR

Reinforcement Learning from Human Feedback, or RLHF, is a training pattern that uses human preference judgments to shape model behavior. In language-model training, the familiar form is supervised fine-tuning, reward-model training from comparisons, and policy optimization against that reward model.

## Core Claims

RLHF starts from preference data rather than only from demonstrations. Human comparisons are used to train a reward model that predicts which outputs people prefer.

InstructGPT made this pattern central for instruction-following language models. Its pipeline combined supervised fine-tuning, a reward model trained on ranked outputs, and PPO-based reinforcement learning.

DPO is a useful boundary case: it also uses preference data, but optimizes a policy through a direct objective rather than the same explicit reward-model plus reinforcement-learning loop used in standard RLHF.

## Citation Boundaries

Use this article for stable RLHF concepts. Do not use it for claims about the private alignment methods of a current commercial model, or for claims that RLHF fully solves safety, truthfulness, or reward hacking.

## Further Reading

- [Deep Reinforcement Learning from Human Preferences](https://arxiv.org/abs/1706.03741)
- [Training Language Models to Follow Instructions with Human Feedback](https://arxiv.org/abs/2203.02155)
- [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290)