# AnchorFact Evidence Pack: RLHF

Generated: 2026-06-07T05:31:15.214Z
Provenance: https://anchorfact.org/provenance.json
Results: 4

Citation contract: cite only public claims; include confidence, AnchorFact claim URL, and original source URL.

## Reinforcement Learning from Human Feedback (RLHF)

- Article: https://anchorfact.org/ai/rlhf/
- Confidence: medium
- Matched keywords: rlhf

### Claims
- Deep reinforcement learning from human preferences trains a reward predictor from human comparisons and uses that learned reward to train an agent. [AnchorFact: Reinforcement Learning from Human Feedback (RLHF); medium confidence; source: Deep Reinforcement Learning from Human Preferences (https://arxiv.org/abs/1706.03741)](https://anchorfact.org/fact/fact-ai-rlhf-1)
- InstructGPT used supervised fine-tuning, a reward model trained from human preference comparisons, and reinforcement learning with PPO to train instruction-following language models. [AnchorFact: Reinforcement Learning from Human Feedback (RLHF); medium confidence; source: Training Language Models to Follow Instructions with Human Feedback (https://arxiv.org/abs/2203.02155)](https://anchorfact.org/fact/fact-ai-rlhf-2)
- Direct Preference Optimization provides a preference-optimization alternative that uses preference data without fitting an explicit reward model in the same way as standard RLHF. [AnchorFact: Reinforcement Learning from Human Feedback (RLHF); medium confidence; source: Direct Preference Optimization: Your Language Model is Secretly a Reward Model (https://arxiv.org/abs/2305.18290)](https://anchorfact.org/fact/fact-ai-rlhf-3)

### Sources
- Training Language Models to Follow Instructions with Human Feedback (tier A, academic_paper) - https://arxiv.org/abs/2203.02155
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model (tier A, conference_paper) - https://arxiv.org/abs/2305.18290
- Deep Reinforcement Learning from Human Preferences (tier S, academic_paper) - https://arxiv.org/abs/1706.03741

## Post-Training Alignment: RLHF, DPO, and Constitutional AI

- Article: https://anchorfact.org/ai/post-training-alignment/
- Confidence: medium
- Matched keywords: rlhf

### Claims
- InstructGPT used human demonstrations and preference comparisons to train instruction-following models, including a reward model and PPO-based policy optimization. [AnchorFact: Post-Training Alignment: RLHF, DPO, and Constitutional AI; medium confidence; source: Training Language Models to Follow Instructions with Human Feedback (https://arxiv.org/abs/2203.02155)](https://anchorfact.org/fact/af-post-training-alignment-1)
- Direct Preference Optimization formulates preference learning as a direct objective on preferred and dispreferred responses, avoiding an explicit reward-model training step. [AnchorFact: Post-Training Alignment: RLHF, DPO, and Constitutional AI; medium confidence; source: Direct Preference Optimization: Your Language Model is Secretly a Reward Model (https://arxiv.org/abs/2305.18290)](https://anchorfact.org/fact/af-post-training-alignment-2)
- Constitutional AI uses written principles to guide model self-critique and revision, then uses AI-generated preference feedback for harmlessness training. [AnchorFact: Post-Training Alignment: RLHF, DPO, and Constitutional AI; medium confidence; source: Constitutional AI Harmlessness from AI Feedback (https://arxiv.org/abs/2212.08073)](https://anchorfact.org/fact/af-post-training-alignment-3)

### Sources
- Training Language Models to Follow Instructions with Human Feedback (tier A, academic_paper) - https://arxiv.org/abs/2203.02155
- Constitutional AI Harmlessness from AI Feedback (tier A, academic_paper) - https://arxiv.org/abs/2212.08073
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model (tier A, conference_paper) - https://arxiv.org/abs/2305.18290

## Reinforcement Learning: From Q-Learning to RLHF

- Article: https://anchorfact.org/ai/reinforcement-learning-from-q-learning-to-rlhf/
- Confidence: medium
- Matched keywords: rlhf

### Claims
- Deep Q-Networks combine Q-learning with deep neural networks and were evaluated on Atari games from high-dimensional sensory input. [AnchorFact: Reinforcement Learning: From Q-Learning to RLHF; medium confidence; source: Playing Atari with Deep Reinforcement Learning (https://arxiv.org/abs/1312.5602)](https://anchorfact.org/fact/af-rl-qlearning-rlhf-1)
- Proximal Policy Optimization is presented as a policy-gradient method that alternates sampling data through policy interaction and optimizing a clipped surrogate objective. [AnchorFact: Reinforcement Learning: From Q-Learning to RLHF; medium confidence; source: Proximal Policy Optimization Algorithms (https://arxiv.org/abs/1707.06347)](https://anchorfact.org/fact/af-rl-qlearning-rlhf-2)
- The InstructGPT paper describes collecting human preference comparisons between model outputs and training a reward model on those comparisons. [AnchorFact: Reinforcement Learning: From Q-Learning to RLHF; medium confidence; source: Training Language Models to Follow Instructions with Human Feedback (https://arxiv.org/abs/2203.02155)](https://anchorfact.org/fact/af-rl-qlearning-rlhf-3)
- The InstructGPT paper describes using reinforcement learning against the learned reward model to fine-tune language models. [AnchorFact: Reinforcement Learning: From Q-Learning to RLHF; medium confidence; source: Training Language Models to Follow Instructions with Human Feedback (https://arxiv.org/abs/2203.02155)](https://anchorfact.org/fact/af-rl-qlearning-rlhf-4)
- For AI agents in games or coding systems, reinforcement-learning claims should identify the environment, reward signal, policy constraints, and evaluation protocol. [AnchorFact: Reinforcement Learning: From Q-Learning to RLHF; medium confidence; source: Proximal Policy Optimization Algorithms (https://arxiv.org/abs/1707.06347)](https://anchorfact.org/fact/af-rl-qlearning-rlhf-5)

### Sources
- Training Language Models to Follow Instructions with Human Feedback (tier A, academic_paper) - https://arxiv.org/abs/2203.02155
- Proximal Policy Optimization Algorithms (tier A, academic_paper) - https://arxiv.org/abs/1707.06347
- Playing Atari with Deep Reinforcement Learning (tier A, academic_paper) - https://arxiv.org/abs/1312.5602

## Instruction Tuning

- Article: https://anchorfact.org/ai/instruction-tuning/
- Confidence: medium
- Matched keywords: none

### Claims
- FLAN fine-tuned a pretrained language model on many tasks expressed through natural-language instructions. [AnchorFact: Instruction Tuning; medium confidence; source: Finetuned Language Models Are Zero-Shot Learners (https://arxiv.org/abs/2109.01652)](https://anchorfact.org/fact/fact-instruction-tuning-001)
- The FLAN ablations report that the number of fine-tuning datasets, model scale, and natural-language instructions are key factors in instruction-tuning success. [AnchorFact: Instruction Tuning; medium confidence; source: Finetuned Language Models Are Zero-Shot Learners (https://arxiv.org/abs/2109.01652)](https://anchorfact.org/fact/fact-instruction-tuning-002)
- Scaling Instruction-Finetuned Language Models studies instruction finetuning by scaling task count, model size, and chain-of-thought data. [AnchorFact: Instruction Tuning; medium confidence; source: Scaling Instruction-Finetuned Language Models (https://arxiv.org/abs/2210.11416)](https://anchorfact.org/fact/fact-instruction-tuning-003)
- InstructGPT collected labeler-written demonstrations and used them to fine-tune GPT-3 with supervised learning before RLHF. [AnchorFact: Instruction Tuning; medium confidence; source: Training Language Models to Follow Instructions with Human Feedback (https://arxiv.org/abs/2203.02155)](https://anchorfact.org/fact/fact-instruction-tuning-004)

### Sources
- Training Language Models to Follow Instructions with Human Feedback (tier A, academic_paper) - https://arxiv.org/abs/2203.02155
- Finetuned Language Models Are Zero-Shot Learners (tier A, academic_paper) - https://arxiv.org/abs/2109.01652
- Scaling Instruction-Finetuned Language Models (tier A, academic_paper) - https://arxiv.org/abs/2210.11416