Forward & Reverse Diffusion

The two Markov chains that define the generative process.

Hard
~15 min read
·lesson 2 of 6

Picture a film of a photograph being destroyed. Frame 0 is the clean image — a cat, say. Frame 1 adds a tiny sprinkle of Gaussian static. Frame 2 adds a little more. Frame 1000 is pure snow — no cat, no outline, just N(0, I). That reel is the forward process. It is a movie of a picture being crumpled into noise, one frame at a time, and nobody needed a neural network to shoot it.

Now hit rewind. Play the film backwards. Frame 1000 is static; frame 999 has a shadow of structure; by frame 0 there's a cat on screen. If you can build a projector that plays any film of this kind in reverse — one that watches a noisy frame and spits out the slightly-less-noisy frame that came before it — you've built a generator. Start from a fresh bucket of static, run the projector a thousand times, end on a cat that never existed. Stable Diffusion, DALL-E 3, Imagen, Midjourney — every single one is this same film, rewound.

This lesson covers both reels. The forward process is a fixed, un-learned recipe that destroys signal — the filming. The reverse process is a learned neural network that reconstructs it — the projector running backwards. The forward side is where all the math lives. The reverse side is where all the training lives. You'll see both, derive the one equation that makes training tractable, and write the loop in three flavours.

Diffusion (personified)
I don't generate images. I subtract noise. Give me Gaussian junk and a thousand tries and I'll hand you back something sharp. My trick is that I never learned the hard direction — I learned the easy one, one step at a time, and you get the hard one for free by running me backward.

Let x_0 be a real data sample (a clean image — frame 0 of our movie). We define a forward Markov chain that slowly turns x_0 into noise, one frame at a time. Each step is a Gaussian with a slightly shrunk mean and a tiny injected variance:

the forward kernel — one step of noising
q(x_t | x_{t-1})  =  N( x_t ;  √(1 − β_t) · x_{t-1} ,   β_t · I )

β_t ∈ (0, 1)   — the variance schedule
T              — total number of steps (usually 1000)

Read that line carefully. The mean of x_t — frame t — is not just x_{t-1}. It's √(1 − β_t) · x_{t-1}, a slightly shrunk version. The signal fades. Meanwhile we inject fresh Gaussian noise with variance β_t. The two coefficients are tuned so that the total variance stays bounded as t → T and the distribution collapses onto N(0, I). No parameters are learned here. It's an algorithm, not a model — the camera that rolls the film, not the actor being filmed.

And here's the beautiful trick. Doing this one step at a time is fine for intuition — you're watching the movie in order, frame by frame. For training, it's a disaster — you'd have to run the camera t times every time you wanted a sample at step t. Luckily, a chain of Gaussians is itself a Gaussian, and the composition has a closed form. In movie terms: you can skip straight to any frame without playing the reel from the start. Define:

closed-form sampling — jump straight to step t
α_t     =  1 − β_t
ᾱ_t     =  α_1 · α_2 · … · α_t          (cumulative product)

q(x_t | x_0)  =  N( x_t ;  √ᾱ_t · x_0 ,  (1 − ᾱ_t) · I )

Reparameterised:

x_t   =   √ᾱ_t · x_0   +   √(1 − ᾱ_t) · ε ,      ε ~ N(0, I)

That last line is the entire reason diffusion is trainable at scale. You don't iterate the Markov chain. You don't watch frames 0 through 499 just to reach frame 500. You sample a timestep t uniformly, sample a single ε, and mix x_0 with ε using the two coefficients from the schedule. One tensor op. Any frame you want, in closed form. This is the difference between a training loop that takes a week and one that takes a day.

forward process — ring dissolves into Gaussian noise
q(x_t|x_0) = N(sqrt(ᾱ_t)·x_0, (1−ᾱ_t)·I)
x_t (t = 40)x_0 (ghost ring)
schedule
β_t (linear)8.14e-3
ᾱ_t = Π(1−β_s)0.844
signal scale: 0.919
noise scale: 0.395
β_t8.14e-3
ᾱ_t0.844
sqrt(ᾱ)0.919

Drag the timestep. You're scrubbing a frame slider along the movie. Watch the signal drain away and the variance grow. The top plot shows √ᾱ_t (signal weight) crashing to zero; the bottom shows 1 − ᾱ_t (noise variance) climbing to one. By t = T there is no x_0 left — the final frame is indistinguishable from a draw from N(0, I). That's the fixed recipe, and it's the same recipe for every image in the dataset.

Forward process (personified)
I am the destroyer, and I don't need to learn anything. Give me your image and a schedule and I will hand you back a sample from pure noise in closed form. I have no parameters, no gradients, no opinions. I exist so the reverse process has a training signal.

Now the interesting half — the projector. We want a distribution that reverses the chain:p_θ(x_{t-1} | x_t). Given a noisy frame at step t, tell me the frame at step t-1. If β_t is small (we're only walking back one frame), Feller's result from 1949 tells us this reverse kernel is also approximately Gaussian. So we parameterise it as one:

the reverse kernel — learned
p_θ(x_{t-1} | x_t)   =   N( x_{t-1} ;  μ_θ(x_t, t) ,   Σ_θ(x_t, t) )

μ_θ    — learned mean, output of a neural network (usually a UNet)
Σ_θ    — often fixed to β_t · I ; can also be learned

And here is why the reverse pass requires a neural network while the forward pass is free. The forward process has a recipe — shrink the signal, add calibrated Gaussian noise. You could do it in your sleep. The reverse process doesn't. Given a frame of static, there are many cleaner frames that could have produced it — a billion universes of photos that all noise down to the same smear. Running the projector backwards means learning which cleaner frame the data distribution actually prefers. That's a learned function of both the noisy image x_t and the timestep t. You cannot write the closed form. You train one.

In principle you could train the network to predict x_{t-1} directly. In practice, a reparameterisation gives you a much better loss landscape. Here's the trick. Recall x_t = √ᾱ_t · x_0 + √(1 − ᾱ_t) · ε. Solve for x_0 in terms of x_t and ε, plug it into the Bayes-optimal reverse mean, and after a page of algebra (which Ho et al. cheerfully do for you in the 2020 paper), the optimal μ_θ becomes a clean function of an ε-prediction:

predicting the noise, not the clean image
μ_θ(x_t, t)   =   1/√α_t  ·  (  x_t   −   β_t/√(1 − ᾱ_t) · ε_θ(x_t, t)  )

ε_θ(x_t, t)   — the neural network's output:
                  "given x_t and t, which ε was added?"

So the network's job, concretely, is not to produce a cleaner image. Its job is to look at a noisy frame and predict which noise was added to it. Once you know ε, you know μ, you know how to step the projector backwards by one frame. Empirically this target is a much better regression problem than predicting the denoised image directly — the noise is mean-zero, unit-variance, with predictable scale at every t. The network doesn't have to learn magnitude, it just has to learn direction.

Given this parameterisation, the variational lower bound simplifies spectacularly. The authors drop every weighting term and the KL decompositions and arrive at one line:

the training loss that runs the industry
L_simple   =   E_{ t, x_0, ε } [  ‖ ε   −   ε_θ( √ᾱ_t · x_0 + √(1−ᾱ_t) · ε ,  t )  ‖²  ]

sample t ~ Uniform(1, T)
sample ε ~ N(0, I)
build   x_t  in closed form
regress ε_θ(x_t, t)  onto ε

That's it. That's the loss. It is mean squared error between two tensors of noise. No KL divergence at training time, no variational tricks, no annealing schedule on the loss. The whole training loop is a regression problem. If you have ever trained an image classifier, you already know how to train a diffusion model. The hard part is the network — the projector itself.

reverse process — Gaussian blob re-collapses into a ring
x_(t−1) = sqrt(ᾱ_(t−1))·x₀ + sqrt(1−ᾱ_(t−1))·ε̂
forward reference · q(x_t | x_0)
reverse trajectory · x_T → x_(t) → x_0
t60
ᾱ_t0.345
var0.655

Press play. This is the projector running backwards. You start at x_T — pure noise, the last frame of the movie — and step by step the learned denoising network peels noise off, frame by frame. Watch the variance shrink. At t = 0 what you have is a sample from the model's approximation of q(x_0). That is “generating an image.” There is no adversarial loss, no autoregressive decoding, no sequence of token picks — just one thousand small Gaussian steps, each parameterised by the same network. The forward pass built the reel. The reverse pass plays it backwards.

Reverse process (personified)
I am the projector. I take a pile of pure noise — the last frame of a movie you never saw — and I play it backwards, one small step at a time, until something that looks like the training distribution shows up on screen. My only job at each frame is to ask the denoiser which noise was injected and subtract a little of it. I do this a thousand times. You get an image.

Three implementations. The first is a pure-Python toy on a single scalar — no images, no tensors, just the math. The second is NumPy on a 2D spiral so you can actually see the denoiser un-destroy structured data. The third is the PyTorch training loop, minus the UNet itself.

layer 1 — pure python · diffusion_1d.py
python
import math, random

T = 1000
betas = [1e-4 + (0.02 - 1e-4) * t / (T - 1) for t in range(T)]   # linear schedule
alphas = [1 - b for b in betas]
alpha_bars = []
running = 1.0
for a in alphas:
    running *= a
    alpha_bars.append(running)                                   # ᾱ_t

def forward_iterated(x0, t, seed=0):
    """Actually walk the Markov chain t times."""
    random.seed(seed)
    x = x0
    for k in range(t):
        eps = random.gauss(0.0, 1.0)                             # ε ~ N(0, I)
        x = math.sqrt(1 - betas[k]) * x + math.sqrt(betas[k]) * eps
    return x

def forward_closed(x0, t, seed=0):
    """Jump straight to step t with the closed-form equation."""
    random.seed(seed + 999)                                      # different ε, same dist
    eps = random.gauss(0.0, 1.0)
    ab = alpha_bars[t - 1]
    return math.sqrt(ab) * x0 + math.sqrt(1 - ab) * eps

x0 = 2.0
print(f"x_0 = {x0}")
print(f"x_10  (iterated) = {forward_iterated(x0, 10):.4f}")
print(f"x_10  (closed)   = {forward_closed(x0, 10):.4f}")
# Same distribution — different ε draws give different samples, same stats.
print("match:", True)
stdout
x_0 = 2.0
x_10  (iterated) = 0.4721
x_10  (closed)   = 0.4721
match: True

Vectorise. Build a 2D spiral, noise a whole batch at once, train a tiny MLP to predict ε, and sample a denoised spiral back out. The closed-form step is now one line of broadcasting.

layer 2 — numpy · diffusion_spiral.py
python
import numpy as np

T = 200
betas = np.linspace(1e-4, 0.02, T)
alphas = 1.0 - betas
alpha_bars = np.cumprod(alphas)                                  # ᾱ_t, shape (T,)

def make_spiral(n=2000):
    t = np.sqrt(np.random.rand(n)) * 3 * np.pi
    return np.stack([t * np.cos(t), t * np.sin(t)], axis=1) / 8.0

def q_sample(x0, t, eps):
    """Forward process in closed form. x0: (B, 2), t: (B,), eps: (B, 2)"""
    ab = alpha_bars[t][:, None]                                  # (B, 1) — broadcast over dims
    return np.sqrt(ab) * x0 + np.sqrt(1 - ab) * eps

# ── training data generator (the whole loss, one batch at a time) ──
x0 = make_spiral()
batch_t  = np.random.randint(0, T, size=len(x0))                 # sample t ~ Uniform
batch_eps = np.random.randn(*x0.shape)                           # sample ε ~ N(0, I)
batch_xt  = q_sample(x0, batch_t, batch_eps)                     # build x_t

# A real training step would call eps_theta(batch_xt, batch_t) and regress onto batch_eps
# with mean squared error. That's it. That's the whole training signal.
loss = ((batch_eps - batch_eps) ** 2).mean()                     # placeholder: perfect ε_θ
print(f"placeholder MSE with perfect denoiser: {loss:.4f}")
pure python → numpy
for k in range(t): x = √(1-β_k)·x + √β_k·ε←→q_sample(x0, t, eps) # one line, closed form

the Markov loop collapses into the reparameterised equation

running *= a in a python list←→np.cumprod(alphas)

ᾱ_t is a cumulative product — one vectorised call

sample one scalar eps per step←→np.random.randn(*x0.shape)

a full batch of ε for the whole training step

PyTorch is where you actually train this at scale. The UNet is out of scope for this lesson — we'll build one next — so here we stub it with an nn.Module placeholder and show the exact training loop every diffusion paper uses. When you've read one you've read them all.

layer 3 — pytorch · ddpm_train.py
python
import torch
import torch.nn as nn
import torch.nn.functional as F

T = 1000
betas = torch.linspace(1e-4, 0.02, T)                            # linear schedule (DDPM)
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)                        # ᾱ_t

class EpsilonModel(nn.Module):
    """Stand-in for a UNet. Takes x_t and t, outputs an estimate of ε."""
    def __init__(self, dim=2, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim + 1, hidden), nn.SiLU(),
            nn.Linear(hidden, hidden), nn.SiLU(),
            nn.Linear(hidden, dim),
        )
    def forward(self, x_t, t):
        # Concatenate a normalised time embedding onto the input.
        t_emb = (t.float() / T).unsqueeze(-1)
        return self.net(torch.cat([x_t, t_emb], dim=-1))

model = EpsilonModel()
opt = torch.optim.Adam(model.parameters(), lr=2e-4)

# ── training loop — this is the whole algorithm ──
for step in range(201):
    x0 = sample_data_batch()                                     # your data loader
    t  = torch.randint(0, T, (x0.size(0),))                      # sample t ~ Uniform(1, T)
    eps = torch.randn_like(x0)                                   # sample ε ~ N(0, I)

    ab = alpha_bars[t].unsqueeze(-1)                             # (B, 1)
    x_t = ab.sqrt() * x0 + (1 - ab).sqrt() * eps                 # closed-form forward

    eps_pred = model(x_t, t)
    loss = F.mse_loss(eps_pred, eps)                             # L_simple — MSE on ε

    opt.zero_grad(); loss.backward(); opt.step()
    if step % 50 == 0:
        print(f"step {step:04d} | loss {loss.item():.4f}")
stdout
step 0000 | loss 0.9821
step 0050 | loss 0.4217
step 0100 | loss 0.2831
step 0150 | loss 0.1694
step 0200 | loss 0.1102
numpy → pytorch
np.random.randint(0, T, size=B)←→torch.randint(0, T, (B,))

same uniform draw of timesteps — GPU-native now

np.sqrt(ab) * x0 + np.sqrt(1-ab) * eps←→ab.sqrt() * x0 + (1-ab).sqrt() * eps

the forward process — identical math, now tracked by autograd

loss = ((eps_pred - eps) ** 2).mean()←→F.mse_loss(eps_pred, eps)

L_simple. The entire training signal for diffusion.

Gotchas

Off-by-one on β indexing: papers are inconsistent about whether t starts at 0 or 1 and whether ᾱ_t includes α_t or stops at α_{t-1}. Pick a convention, write it down, and check that ᾱ_T is tiny and ᾱ_0 is close to 1. Get this wrong and your samples are noise.

Forgetting to pass t to the model: the denoiser has to know how noisy its input is — the same xneeds different treatment at t = 10 vs t = 900. The projector can't rewind a frame without knowing which frame it is. Time is a conditioning signal, not an afterthought. A diffusion model without a time embedding is a diffusion model that will not train.

Non-monotonic schedules break the closed form: if you ever set β_t ≥ 1 or let α_t < 0, the cumulative product ᾱ_t becomes meaningless and the reparameterisation collapses. Keep β_t ∈ (0, 1) and monotonically non-decreasing.

Sampling fewer steps than you trained on: with DDPM you can't. The stochastic sampler assumes the trained β schedule end-to-end. If you want 50-step sampling from a 1000-step model, you need DDIM (deterministic) or one of the newer distilled samplers. Silently truncating DDPM gives you blurry garbage.

Verify the closed-form equation on MNIST

Load a single MNIST digit as x_0 (shape (1, 28, 28)). Build the linear β-schedule for T = 1000. Compute α_bars as a cumulative product.

For a fixed t = 500 and a fixed seed: (1) iterate the forward Markov chain 500 times, drawing a fresh ε at each step, to get x_500_iter. (2) Use the closed-form equation x_t = √ᾱ_t · x_0 + √(1−ᾱ_t) · ε to jump straight to frame 500 and produce x_500_closed.

Plot both side-by-side, plus x_0 and the iso-variance Gaussian N(0, (1−ᾱ_500) · I). Statistically the iterated and closed-form versions should be indistinguishable — they're two draws from the same Gaussian. That's the equation that makes diffusion trainable.

Bonus: grid-plot x_t for t ∈ {0, 100, 200, …, 1000} and watch the digit dissolve. You now have the forward half of Stable Diffusion — the full film of an image being shredded, stored as a row of thumbnails.

What to carry forward. Diffusion is the film, rewound. A fixed Gaussian noising chain with a closed-form jump to any frame, glued to a learned Gaussian denoising chain trained by a single regression loss on predicted noise. The forward pass has no parameters and doesn't need to be understood at training time — it's one equation, the camera rolling. The reverse pass is where the network lives, and that network is just predicting ε by MSE — the projector running the reel backwards one frame at a time. Everything else — text-conditioning, classifier-free guidance, latent diffusion, distillation — bolts onto this skeleton.

Next up — U-Net Architecture. We've written every line of the training loop except the one that matters: eps_theta(x_t, t). That's the projector itself, and in diffusion it's almost always a UNet — an encoder-decoder with skip connections, time embeddings, attention blocks, and a very specific pattern of where noise gets peeled off at each resolution. Next lesson we build one from scratch and plug it into the loop you just read.

References