## TL;DR
Deep Reinforcement Learning has evolved from simple DQN to sophisticated algorithms: PPO dominates continuous control, SAC excels at sample-efficient exploration, Dreamer learns world models, and Decision Transformer reframes RL as sequence modeling.

## Core Explanation
RL loop: agent observes state s, takes action a, receives reward r, transitions to s'. Goal: maximize cumulative reward. Algorithm families: (1) Value-based (DQN) — learn Q(s,a) values, act greedily; (2) Policy-based (REINFORCE) — directly optimize policy π(a|s); (3) Actor-critic (PPO, SAC) — combine both. PPO uses importance sampling with clipping for stable updates; SAC adds entropy bonus for exploration; Dreamer builds learned world model for planning.

## Detailed Analysis
Offline RL trains from fixed datasets without environment interaction — Decision Transformer treats RL trajectories as sequences and uses causal self-attention: given return-to-go, past states, and past actions, predict next action. CQL (Conservative Q-Learning) prevents overestimation on out-of-distribution actions. Model-based RL (Dreamer, MuZero) learns environment dynamics for planning in latent space, dramatically improving sample efficiency.

## Further Reading
- Spinning Up in Deep RL (OpenAI)
- Stable-Baselines3 Library
- RLHF in Practice