Diffusion Models: DDPM, Stable Diffusion, and Score-Based Generative Modeling

Status: public · Confidence: medium (0.8) · Basis: verified_sources
## TL;DR

Diffusion models generate data by learning how to reverse a noise-adding process. DDPMs made the denoising formulation influential, score-based SDEs gave a continuous-time view, and latent diffusion moved diffusion into a compressed latent space that made high-resolution conditioned generation more practical.

## Core Explanation

The core idea is to corrupt training examples with noise and train a model to move in the reverse direction. During generation, the model starts from noise and repeatedly denoises toward a sample. This framing is central to modern image and video generation systems, although video generation adds additional temporal consistency and motion constraints beyond the image-only foundations covered here.

Stable Diffusion-style systems are commonly explained through latent diffusion: instead of running the denoising process directly in pixel space, the model works in a learned latent representation and can be conditioned by text through cross-attention.

## Source-Mapped Facts

- Denoising Diffusion Probabilistic Models train a model to reverse a gradual noising process. ([source](https://arxiv.org/abs/2006.11239))
- The DDPM paper connects diffusion probabilistic models with denoising score matching and Langevin dynamics. ([source](https://arxiv.org/abs/2006.11239))
- Score-Based Generative Modeling through Stochastic Differential Equations presents an SDE that transforms a data distribution to a prior by injecting noise and a reverse-time SDE that removes noise. ([source](https://arxiv.org/abs/2011.13456))
- High-Resolution Image Synthesis with Latent Diffusion Models performs diffusion in the latent space of pretrained autoencoders. ([source](https://arxiv.org/abs/2112.10752))
- The latent diffusion paper uses cross-attention layers to condition diffusion models on inputs such as text or bounding boxes. ([source](https://arxiv.org/abs/2112.10752))

## Why This Matters for Video Generation

Video generation systems build on the same denoising intuition but must also preserve motion, identity, scene consistency, and timing. For practitioners, the operational lesson is to separate the foundation from the product surface: diffusion explains the generative mechanism, while video quality depends on temporal architecture, conditioning, data, inference strategy, and evaluation.

## Further Reading

- [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)
- [Score-Based Generative Modeling through Stochastic Differential Equations](https://arxiv.org/abs/2011.13456)
- [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)

## Related Articles

- [Diffusion Models in Depth: From DDPM to Stable Diffusion](../latent-diffusion-models.md)
- [AI Art and Creativity: Generative Models and Authorship](../ai-art-and-creativity.md)
- [3D Human Modeling: Parametric Body Models, Mesh Recovery, and Digital Avatars](../3d-human-modeling.md)