Diffusion Models — AnchorFact

## TL;DR

Diffusion models are generative models that create data (images, video, audio, 3D) by learning to reverse a gradual noise-adding process. Starting from pure random noise, they iteratively denoise toward a coherent output. Popularized by DDPM (Ho et al., 2020, UC Berkeley, 15,000+ citations on Google Scholar as of May 2026) and made practical by Stable Diffusion / Latent Diffusion Models (Rombach et al., 2022), diffusion has surpassed GANs as the dominant paradigm for high-quality image generation and has expanded to video (Sora), audio (AudioLDM), 3D (DreamFusion), and biology (AlphaFold 3).

## Core Explanation

Diffusion models work in two phases:

### Forward Process (Training)

Starting from a real data point x₀ (an image), Gaussian noise is gradually added over T timesteps according to a predefined variance schedule β₁, ..., β_T:

```
q(x_t | x_{t-1}) = N(x_t; √(1-β_t)·x_{t-1}, β_t·I)
```

After T steps (typically T=1000), x_T is pure isotropic Gaussian noise. The model learns to predict the noise that was added at each step ε_θ(x_t, t) — essentially learning to reverse the destruction process:

```
L = E_{x₀,ε,t}[||ε - ε_θ(x_t, t)||²]
```

This is remarkably simple: the training objective is just a noise prediction regression. No adversarial dynamics, no mode collapse — just predicting which Gaussian noise was added at each step.

### Reverse Process (Generation)

Starting from pure noise x_T ~ N(0, I), the model iteratively removes predicted noise:

```
x_{t-1} = (1/√(1-β_t)) · (x_t - (β_t/√(1-ᾱ_t))·ε_θ(x_t, t)) + σ_t·z
```

Over T steps (or fewer using accelerated samplers like DDIM), this converges to a realistic data point. The process is a Markov chain: each step depends only on the previous state and the current timestep embedding.

## Detailed Analysis

### Why Diffusion Superseded GANs

| Property | GANs | Diffusion Models |
|----------|:----:|:----------------:|
| Training stability | Unstable (adversarial game) | Stable (simple regression loss) |
| Mode coverage | Prone to mode collapse | Naturally covers full data distribution |
| Sample diversity | Can be limited | High diversity |
| Quality at scale | Excellent (StyleGAN) | Excellent, scales predictably |
| Training/inference speed | Fast inference, slow training | Fast training, slow inference (mitigated by DDIM, LCM) |
| Controllability | Limited (requires conditional GAN) | Excellent (cross-attention conditioning) |

The training stability advantage is decisive: GANs require careful balancing of generator and discriminator — if either becomes too strong, training collapses. Diffusion models avoid this entirely by framing generation as a denoising problem with a simple regression objective.

### Latent Diffusion Models (Stable Diffusion)

The original DDPM operates in pixel space, which is computationally prohibitive for high resolution. For a 1024×1024×3 image:
- Pixel-space diffusion: 3.1M dimensions per step × 1000 steps = enormous
- Latent diffusion (LDM): compress to 64×64×4 latent space via VAE → 16K dimensions

This 200× compression reduces training time from GPU-weeks to GPU-hours while improving quality. The VAE encoder E maps images to latent codes; diffusion happens entirely in latent space; the VAE decoder D reconstructs the final image. Text conditioning is added via cross-attention with CLIP text embeddings at each U-Net layer.

Stable Diffusion was trained on **LAION-5B**, a dataset of 5.85 billion image-text pairs scraped from the web, and released as open-source in August 2022 by Stability AI, RunwayML, and the CompVis group at LMU Munich. It democratized image generation: previous systems (DALL·E 2) required API access; Stable Diffusion ran on a consumer GPU.

### Major Implementations Timeline

| System | Developer | Release | Key Innovation |
|--------|-----------|:------:|---------------|
| DDPM | Ho et al. (UC Berkeley) | Jun 2020 | Proved diffusion can generate high-quality images |
| DALL·E 2 | OpenAI | Apr 2022 | CLIP-guided diffusion + inpainting |
| Stable Diffusion | Stability AI / CompVis | Aug 2022 | Latent diffusion, open-source, consumer GPU |
| Midjourney | Midjourney Inc. | Jul 2022 | Proprietary aesthetic tuning, Discord-native |
| Imagen | Google | May 2022 | Large frozen T5 text encoders for conditioning |
| DALL·E 3 | OpenAI | Oct 2023 | Captioner-to-image pipeline (GPT-4V integrated) |
| SDXL | Stability AI | Jul 2023 | 2.6B params, improved composition |
| Sora | OpenAI | Feb 2024 | Video diffusion via spacetime patches |
| Stable Diffusion 3 | Stability AI | Feb 2024 | MMDiT architecture (joint image-text transformer) |
| Flux | Black Forest Labs | Aug 2024 | Flow matching, 12B params, state-of-art quality |

### Beyond Images

| Domain | Representative Work | Key Approach |
|--------|-------------------|-------------|
| **Video** | Sora (OpenAI, 2024), Runway Gen-3 | Treats video as 3D grid of spacetime patches |
| **Audio** | AudioLDM (2023), MusicGen (Meta, 2023) | Latent diffusion on mel-spectrograms; text-to-music |
| **3D** | DreamFusion (Google, 2022), Zero123 (2023) | Score Distillation Sampling; multi-view diffusion |
| **Biology** | AlphaFold 3 (DeepMind, 2024) | Diffusion module for protein-ligand structure prediction |
| **Motion** | MDM (2023), MotionDiffuse (2023) | Human motion generation for animation and robotics |

## Further Reading

- [DDPM Paper](https://arxiv.org/abs/2006.11239): Original diffusion models paper (15K+ citations)
- [Stable Diffusion / LDM](https://arxiv.org/abs/2112.10752): Latent diffusion for efficient generation
- [Score-Based SDE](https://arxiv.org/abs/2011.13456): Unified framework for diffusion and score-based models