DDPM from Scratch

Train a tiny diffusion model on MNIST, end to end.

Hard
~15 min read
·lesson 4 of 6

You have been collecting ingredients. Three lessons of pantry-stocking. Denoising gave you the raw idea — a staircase of Gaussian corruption, run in reverse. The forward process gave you the closed-form jump — the marinade, timed by a variance schedule, that takes a clean image and turns it into a specific flavor of noise. The UNet gave you the stove — an hourglass of convolutions that eats a noisy image and a timestep and spits out the noise it thinks is hiding inside.

This lesson is the whole kitchen, running at once. One recipe. Every ingredient lined up on the counter, every tool plugged in, the timer set. A DDPM — denoising diffusion probabilistic model — is what you cook when you stop treating those pieces as separate dishes and finally assemble them in the order the recipe demands. Training is a loop. Generation is a loop. Both loops are short, and the surprise — the reason every modern image model (Stable Diffusion, Imagen, DALL·E 3) is a diffusion model in disguise — is that these short loops are enough to turn a bin of pure noise into a recognizable handwritten digit. You will cook one here, end to end, and something will come out of the oven that you can actually taste.

  ┌──────────────────────── training ────────────────────────┐
  │                                                          │
  │   x₀  ── sample batch ─┐                                  │
  │                        │                                  │
  │   t   ∼ U(1, T)  ──────┤                                  │
  │   ε   ∼ N(0, I)  ──────┤                                  │
  │                        ▼                                  │
  │        x_t = √ᾱ_t · x₀  +  √(1-ᾱ_t) · ε                    │
  │                        │                                  │
  │                        ▼                                  │
  │              U-Net(x_t, t)  →  ε̂                          │
  │                        │                                  │
  │                        ▼                                  │
  │          loss = ‖ε - ε̂‖²   →  backward → step             │
  │                                                          │
  └──────────────────────────────────────────────────────────┘

  ┌─────────────────────── generation ───────────────────────┐
  │                                                          │
  │     x_T ∼ N(0, I)   ← pure noise, no image yet           │
  │                                                          │
  │     for t = T, T-1, … , 1:                               │
  │         ε̂ = U-Net(x_t, t)                                │
  │         x_{t-1} = mean(x_t, ε̂)  +  σ_t · z               │
  │                                                          │
  │     return x_0   ← a digit                              │
  │                                                          │
  └──────────────────────────────────────────────────────────┘
the whole recipe, end to end — the order is the dish

Two assembly tricks make this cookbook actually cook. First, the closed-form jump x_t = √ᾱ_t · x₀ + √(1-ᾱ_t) · ε means we never simulate a chain of T corruption steps during training — we jump to a random timestep in one matmul and train on that single pair. It is the recipe's “marinate overnight in one shot” cheat. Second, the regression target is the noise we added, not the clean image. The UNet learns a noise prediction function, and everything else — the sampler, the variance, the mean — falls out of rearranging that noise estimate through the forward-process algebra. One ingredient predicted; everything else derived.

training objective — the entire loss, one line
β_1, β_2, … , β_T  ∈ (0, 1)                      # variance schedule (linear: 1e-4 → 0.02)
α_t     = 1 - β_t
ᾱ_t     = ∏_{s=1..t} α_s                         # cumulative product, precompute once

x_0     ∼  p_data                                # sample a clean image
t       ∼  U{1, …, T}                            # sample a timestep
ε       ∼  N(0, I)                               # sample target noise

x_t     =  √ᾱ_t · x_0  +  √(1 - ᾱ_t) · ε         # forward jump, no simulation
ε̂       =  u_net(x_t, t; θ)                      # U-Net prediction

L_simple(θ)  =  E_{x_0, t, ε} ‖ ε - ε̂ ‖²          # just MSE on the noise

That is it. One expectation, one MSE. Ho et al. 2020 showed this “simple” loss is a reweighted variational lower bound and that the reweighting is actually better for sample quality than the theoretically-correct weights. So the loss every diffusion paper since 2020 uses is three lines of PyTorch. The hard work — the ingredient prep — was the UNet and the schedule. You already did it. The thermometer is cheap.

one DDPM training iteration — end-to-end
loss = MSE(ε̂, ε) · one step of SGD · T = 100
training step: sample · noise · predict · loss
1. x₀
clean sample
+
2. ε
N(0, I)
=
3. x_18
sqrt(ᾱ)x₀ + sqrt(1−ᾱ)ε
U-Net(x_t, t)
4. ε̂
model output
loss
0.46
forward fixes what you feed the model · backward is one SGD step on ‖ε̂ − ε‖² · repeat millions of times
t18
ᾱ_t0.964
loss0.463

Watch what the kitchen is actually doing. Each step: grab a digit from the pantry, pick a random t, add exactly the amount of noise the schedule says corresponds to that t, ask the UNet to recover the noise pattern, take the squared error, backprop. The network never sees a “full chain” of T=1000 denoisings during training — it sees thousands of independent one-shot noise-prediction problems, and the chain only assembles itself at generation time. That decoupling is what makes DDPM training tractable, and it is the single most important thing to notice about the recipe: the UNet is not learning a sequence, it is learning a one-ingredient skill that the sampler reuses a thousand times in a row.

The loss curve looks boring. It drops fast in the first few hundred steps (the network learns “output something that averages to zero”) and then flattens into a slow, noisy decline for the rest of the run. Do not be fooled: sample quality keeps improving long after the loss curve has visually plateaued, because what is improving is the network's accuracy on the hard timesteps — the mid-range t where the image is mostly-but-not-quite noise and the signal is subtle.

Noise prediction (personified)
I am the regression target, and I am deliberately boring. I am not a clean digit. I am not even close to one. I am a fresh Gaussian sample the trainer rolled right before adding me to x_0. The UNet's job is to look at the contaminated image and guess me back — not the digit underneath, just me. Why? Once the UNet can do that, rearranging x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε gives you the digit for free. I am the supervision signal that makes everything else closed-form.
sampling — the reverse process, step by step
x_T  ∼  N(0, I)                                  # start from pure noise

for t = T, T-1, … , 1:
    ε̂        =  u_net(x_t, t; θ)                  # predict the noise that is "in" x_t

    μ_θ(x_t, t)  =  1/√α_t · ( x_t  -  β_t / √(1-ᾱ_t) · ε̂ )
    #                    ↑ subtract a scaled noise estimate; scale back up by 1/√α_t

    σ_t²         =  β_t                            # DDPM variance (β̃_t also common)

    z  ∼  N(0, I)  if t > 1  else  0
    x_{t-1}  =  μ_θ(x_t, t)  +  σ_t · z           # one denoising step

return x_0                                        # a sample from the model

Plating. The reverse walk is where the meal leaves the kitchen. At each step we ask the UNet “how much noise is in this?”, subtract a scaled estimate, rescale, and add a dollop of fresh noise for stochasticity. The mean formula looks mysterious but is just algebra: it is the posterior mean of x_{t-1} given x_t and our best guess of x_0, worked out under the Gaussian forward model. You memorize the formula; the paper derives it in two pages of completing-the-square. Every ingredient from the pantry earns its seat right here — the schedule sets β_t and ᾱ_t, the UNet delivers ε̂, the algebra folds them into one mean, and the loop repeats until the dish is done.

DDPM sampling — noise re-condenses into an image
x_T → x_0 · 50 reverse steps
preview — 10 frames along the trajectory
t=0
t=6
t=11
t=17
t=22
t=28
t=33
t=39
t=44
t=50
current x_50
x_T = pure noise
x_0 = clean sample
reverse step: x_(t−1) ← denoise(x_t)
t50
var0.397

The widget starts from a square of Gaussian static — as informative as TV snow — and runs T=1000 reverse steps. For the first ~700 steps nothing you can see happens. Around step 300 a blurry low-frequency blob emerges. By step 100 you can guess the digit. By step 0 the strokes have sharp edges and a plausible thickness. The reverse chain is a coarse-to-fine generator: big shapes resolve first, textures last. Diffusion's multi-scale behavior is why it blows single-step generators (GANs, VAEs) out of the water on sample diversity — each timestep is a fresh opportunity for the model to commit to a different global structure.

Sample trajectory (personified)
I am the path a generated image takes through noise levels. I start at x_T, pure static, and end at x_0, a clean digit. You can plot me as 1000 thumbnails. Early thumbnails are indistinguishable from each other — all noise. The middle thumbnails are where the image “decides” what digit to be; this is where different random seeds diverge. Late thumbnails just sharpen what the middle already committed to. If you want a sample to follow a specific trajectory, you intervene in the middle. That is the doorway to classifier-free guidance, img2img, inpainting, and every other controllable diffusion trick.

Three layers of code, one recipe at three scales. We start with a 2-D toy — diffusion on a Swiss roll, where the “stove” is a tiny MLP and you can plot the whole thing in matplotlib — then graduate to a real PyTorch implementation that trains on MNIST with a small UNet, and finally we show how the same dish takes about fifteen lines in Hugging Face's diffusers library with a pretrained backbone. Same cookbook, three kitchens.

layer 1 — numpy · ddpm_swissroll.py (2-D toy, MLP noise predictor)
python
import numpy as np
from sklearn.datasets import make_swiss_roll

rng = np.random.default_rng(0)

# ---- data: 2-D swiss roll, standardized ----
X, _ = make_swiss_roll(n_samples=10_000, noise=0.5)
X = X[:, [0, 2]] / 10.0                            # drop the y-axis; 2-D points

# ---- schedule ----
T = 200
betas = np.linspace(1e-4, 0.02, T)
alphas = 1 - betas
abar = np.cumprod(alphas)                          # ᾱ_t

# ---- tiny MLP noise predictor:  (x_t, t) -> ε̂ ----
def init(n_in, n_out):
    return rng.normal(0, np.sqrt(2 / n_in), (n_in, n_out))
W1, W2, W3 = init(2 + 16, 128), init(128, 128), init(128, 2)
b1, b2, b3 = np.zeros(128), np.zeros(128), np.zeros(2)

def time_embed(t):                                 # sinusoidal, 16-D
    freqs = 10 ** np.linspace(0, 2, 8)
    ang = t[:, None] / freqs
    return np.concatenate([np.sin(ang), np.cos(ang)], axis=-1)

def forward(x, t):
    h = np.concatenate([x, time_embed(t)], axis=-1)
    h = np.maximum(0, h @ W1 + b1)
    h = np.maximum(0, h @ W2 + b2)
    return h @ W3 + b3

# ---- training loop (sketch; full backprop omitted for brevity) ----
LR, BATCH = 1e-3, 256
for step in range(2001):
    idx = rng.integers(0, len(X), BATCH)
    x0 = X[idx]
    t  = rng.integers(1, T, BATCH)
    eps = rng.standard_normal(x0.shape)
    xt  = np.sqrt(abar[t])[:, None] * x0 + np.sqrt(1 - abar[t])[:, None] * eps
    eps_hat = forward(xt, t.astype(float))
    loss = ((eps - eps_hat) ** 2).mean()
    # ... manual backprop + SGD on W1..W3, b1..b3 ...
    if step % 500 == 0:
        print(f"step {step:4d}  loss={loss:.4f}")

# ---- generation: reverse walk ----
x = rng.standard_normal((2000, 2))
for t in range(T - 1, -1, -1):
    eps_hat = forward(x, np.full(len(x), t, dtype=float))
    mean = (x - betas[t] / np.sqrt(1 - abar[t]) * eps_hat) / np.sqrt(alphas[t])
    x = mean + (np.sqrt(betas[t]) * rng.standard_normal(x.shape) if t > 0 else 0)
# 'x' now resembles the swiss-roll distribution
stdout
step    0  loss=1.0312
step  500  loss=0.2178
step 1000  loss=0.1034
step 1500  loss=0.0721
step 2000  loss=0.0613
generated 2000 samples; visually matches the swiss-roll density.
layer 2 — pytorch · ddpm_mnist.py (tiny U-Net, full DDPM on MNIST)
python
import torch, torch.nn as nn, torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from copy import deepcopy

device = 'cuda' if torch.cuda.is_available() else 'cpu'
T = 1000
betas = torch.linspace(1e-4, 0.02, T, device=device)
alphas = 1 - betas
abar   = torch.cumprod(alphas, dim=0)                 # ᾱ_t

def q_sample(x0, t, eps):
    """forward: jump straight to x_t."""
    a = abar[t].view(-1, 1, 1, 1)
    return a.sqrt() * x0 + (1 - a).sqrt() * eps

class TinyUNet(nn.Module):                            # the one you built last lesson
    # ... conv-down-conv-up with skip connections and sinusoidal time embedding ...
    def forward(self, x, t): ...

model     = TinyUNet().to(device)
ema_model = deepcopy(model).eval()                    # running average of weights
for p in ema_model.parameters(): p.requires_grad_(False)
opt = torch.optim.AdamW(model.parameters(), lr=2e-4)

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
loader = DataLoader(datasets.MNIST('.', train=True, download=True, transform=transform),
                    batch_size=64, shuffle=True)

def ema_update(ema, online, decay=0.999):
    for pe, po in zip(ema.parameters(), online.parameters()):
        pe.data.mul_(decay).add_(po.data, alpha=1 - decay)

for epoch in range(1, 41):
    for xb, _ in loader:
        xb = xb.to(device)                             # [B, 1, 28, 28], in [-1, 1]
        t  = torch.randint(0, T, (xb.size(0),), device=device)
        eps = torch.randn_like(xb)
        x_t = q_sample(xb, t, eps)
        eps_hat = model(x_t, t)
        loss = F.mse_loss(eps_hat, eps)
        opt.zero_grad(); loss.backward(); opt.step()
        ema_update(ema_model, model)

# ---- generation (use EMA weights) ----
@torch.no_grad()
def sample(n=64):
    x = torch.randn(n, 1, 28, 28, device=device)
    for t in reversed(range(T)):
        tb = torch.full((n,), t, device=device, dtype=torch.long)
        eps_hat = ema_model(x, tb)
        a_t, b_t, ab_t = alphas[t], betas[t], abar[t]
        mean = (x - b_t / (1 - ab_t).sqrt() * eps_hat) / a_t.sqrt()
        x = mean + (b_t.sqrt() * torch.randn_like(x) if t > 0 else 0)
    return x.clamp(-1, 1)
stdout
epoch  1  step  937  loss=0.1481
epoch  5  step 4685  loss=0.0654
epoch 10  step 9370  loss=0.0498
epoch 20  step 18740 loss=0.0421
epoch 40  step 37480 loss=0.0388
saved samples/ema_epoch_40.png  (8x8 grid of generated digits)
numpy toy → pytorch mnist
MLP(2+16 → 128 → 128 → 2)←→U-Net with conv down/up + skip conns

spatial structure demands convolutions; MLPs do not scale to images

manual sinusoidal time embedding←→same embedding, fed into every ResBlock

the U-Net uses t at every scale, not just the input

no EMA, greedy sampling←→EMA of weights + EMA-only sampling

single biggest sample-quality improvement per line of code

abar precomputed in numpy←→abar as a registered GPU tensor

cumulative product is index-once, use-everywhere

layer 3 — diffusers · ddpm_pretrained.py
python
from diffusers import DDPMPipeline
import torch

# A pretrained DDPM that someone else spent a week training on A100s.
pipe = DDPMPipeline.from_pretrained("google/ddpm-celebahq-256")
pipe = pipe.to("cuda")

# One line replaces the entire reverse-process loop above.
images = pipe(num_inference_steps=1000, batch_size=4).images
for i, img in enumerate(images):
    img.save(f"out/sample_{i}.png")
stdout
Downloaded google/ddpm-celebahq-256  (holy smokes, 339 MB)
Running 1000 reverse steps on cuda:0 ...
done.  Saved 4 samples to ./out/  (256x256, unmistakably faces)
pytorch scratch → diffusers
hand-coded schedule, q_sample, sampler←→DDPMScheduler handles both

one object, unified API for DDPM / DDIM / PNDM

TinyUNet class with conditioning←→UNet2DModel / UNet2DConditionModel

production-grade blocks, attention, class conditioning, FP16

manual EMA update loop←→EMAModel wrapper, one call per step

also does FP32 shadow, decay warmup

hand-rolled training loop←→accelerate + diffusers/examples/unconditional_image_generation

multi-GPU, mixed precision, checkpointing, wandb — for free

Gotchas

Predicting x₀ instead of ε when the schedule expects ε. Both parameterizations are valid (x₀-prediction, ε-prediction, and Karras' v-prediction all work) but the sampler has to match the training target. A UNet trained with ε-loss plugged into an x₀-sampler produces elegant noise. The number you subtract in the mean formula is scaled differently in each case; eyeball the √(1-ᾱ_t) term — that is the ε-prediction coefficient.

Forgetting EMA at sample time. Training with EMA is pointless if you sample from the online model. Keep two state dicts around, and only save the EMA one for inference.

Using a different T at inference than at training. The ᾱ schedule is indexed by integer t ∈ [0, T). If you trained with T=1000 and try to sample with T=100 by just dropping indices, you get nonsense — the noise levels don't line up. To sample faster you need a principled subsampler like DDIM, which picks a subset of timesteps and reweights the variance so the reverse process still ends at x_0.

Bad initialization on the convolutional weights. The output layer of the UNet predicts noise with mean ~0 and std ~1. If the last convolution is initialized with a big gain, early predictions overshoot, MSE blows up to 100+, and the optimizer wastes its first few thousand steps just pulling everything back to zero. Initialize the last conv with zero_init (weight = 0, bias = 0) — a standard trick from Karras' EDM paper — or use Kaiming init with gain 0.1.

Normalizing inputs to [0, 1]. Diffusion expects data centered at 0, so normalize MNIST pixels to [-1, 1] (i.e. (x - 0.5) / 0.5). If you leave the data in [0, 1], the forward process drifts off-center and the noise target no longer matches what the UNet expects.

Train a DDPM on Fashion-MNIST, compute FID, vary T

Swap datasets.MNIST for datasets.FashionMNIST in the layer-2 code (same 28×28 grayscale, different label set: shirts, bags, sneakers) and retrain the same UNet for 40 epochs. Hold out 10% of the training set to compute FID — grab a pretrained Inception-v3, extract 2048-D pool features for 5000 generated samples and 5000 held-out real samples, fit Gaussians, compute the Fréchet distance.

Then, with the exact same trained weights, sample once with T=1000 and once with a DDIM sampler using T=100. Record FID for each. Questions: how much sample quality did you lose by sampling 10× faster? Was it worse for clothes than it would be for digits — why? (Hint: think about where fine-scale detail lives in a shirt vs. a “3”.)

What to carry forward. A diffusion model is a complete dish — noise schedule, UNet, MSE loss, sampling loop — cooked in the one order the cookbook allows. Training is three lines. Sampling is five. Every ingredient from the earlier lessons finally earned its seat on the counter: the schedule sets the marinade, the UNet is the stove, the MSE thermometer calls the finish, the sampler handles the plating. Everything else you will meet — guidance, inpainting, latent diffusion, video diffusion, flow matching — is a variation on this assembly, with a different ingredient swapped in. The kitchen you just used is the same one, at the core, that cooks Midjourney and Stable Diffusion outputs. They just have bigger UNets, more data, and a text encoder bolted on.

Next up — Classifier-Free Guidance. Right now your DDPM cooks any digit — a 3, a 7, whatever the random noise rolls. The next lesson is how to tell the recipe “give me a 7” without training a separate classifier to taste-test every dish. The trick is to train the UNet jointly on conditioned and unconditioned examples, then at sample time extrapolate away from the unconditional prediction toward the conditional one. It is the single most important post-DDPM paper, and it is the ingredient that turns this cookbook into the one behind every text-to-image model you have heard of.

References