Denoising Intuition

Learn to reverse a staircase of Gaussian noise.

Easy
~15 min read
·lesson 1 of 6

Michelangelo, asked how he carved the David out of a block of Carrara marble, is supposed to have said: the statue is already in the stone; I just remove what isn't David. Whether he actually said it or some biographer wrote it in for him is beside the point — the image is perfect. A sculptor stares at a featureless block, and chip by chip, removes everything that isn't the figure. The statue doesn't get built. It gets uncovered.

Keep that image. Now swap the marble for a television tuned to a dead channel — a full screen of grey, hissing static, the kind of snow that used to fill CRTs at 3 a.m. between broadcasts. And swap the sculptor's chisel for a neural network. That's a diffusion model. It stares at a rectangle of pure static, decides what in the static isn't part of the final image, chisels a little of it away, and repeats. Do this a thousand times and a photograph emerges — a cat, a castle, your cousin Linda — out of what started as nothing but noise.

The generative modeling problem, stated plainly: somewhere out there is a distribution p(x) — the set of all photographs a camera could plausibly take, the set of all English sentences a person might write, the set of all protein folds that don't collapse on themselves. You have a finite pile of samples from it. You want a machine that, on demand, produces a new sample that looks like it came from the same distribution but isn't a copy of anything in your pile. A new face, a new paragraph, a new molecule.

This is a monstrously hard ask. The space of possible images at 256 × 256 × 3 resolution has 256196608 points in it, and the tiny sliver that looks like a real photo is, in relative terms, essentially zero volume. You are asking a model to find the needle while the haystack is most of the observable universe.

Diffusion models solve this with the sculptor's move. Take a clean image. Add a little Gaussian noise. Add a little more. Keep adding for a thousand steps until the image is indistinguishable from the static on that dead channel. Now train a neural network to undo one step of that process — one chisel stroke, no more. That's it. At generation time you start with a fresh screen of static and run the chisel a thousand times. Out the other end falls a statue.

Diffusion (personified)
I do not compress. I do not compete. I take a path you can see — clean image to pure static, one chisel stroke at a time — and I teach a network to walk it backward. Every single stroke is easy. The whole marathon is what looks like magic.
     forward  (fixed, no learning)            reverse  (learned chisel)
     ─────────────────────────────>            <─────────────────────────────
      x₀ ─► x₁ ─► x₂ ─► ··· ─► xₜ               xₜ ◄─ xₜ₋₁ ◄─ ··· ◄─ x₁ ◄─ x₀
     clean    +ε    +ε         pure            static      chisel          clean
     image          noise      static          (marble)                    image
                                                                        (the statue)

                   (entropy ↑)                             (entropy ↓, learned)
the two processes — one buries the statue, one uncovers it

Here is the forward process, visually. One image, ten rungs, each rung a little noisier than the last — the sculptor's process in reverse, the statue being slowly buried under chips of marble until only the block remains. Slide left to right and you watch a cat dissolve into static. Slide right to left — the arrow labelled denoising — and you watch the path the chisel is trying to learn. Forward buries the statue. Reverse uncovers it.

forward diffusion staircase — clean to pure noise in 10 steps
x_t = sqrt(1-β_t)·x_(t-1) + sqrt(β_t)·ε
x_0 — x_1 — x_2 — ··· — x_9
t=0
β=0.04
t=1
β=0.10
t=2
β=0.15
t=3
β=0.21
t=4
β=0.27
t=5
β=0.32
t=6
β=0.38
t=7
β=0.44
t=8
β=0.49
t=9
β=0.55
◄ clean (signal-dominated)pure noise (variance ≈ 1) ►
β_t0.267
ᾱ_t0.425
snr0.74

The key insight: each rung of this staircase differs from its neighbour by only a tiny amount of noise. Predicting xt-1 from xt is not a mystical leap — it's a near-identity map with a small correction. The far ends look nothing alike (clean cat vs hissing static), but any two adjacent rungs look nearly identical. You're not asking the chisel to carve the David out of a raw block in one swing. You're asking it to take a David that's 99.9% finished and flick off one last sliver of marble.

That's why this works and VAEs struggled and GANs fought each other. The denoising objective splits an impossibly hard problem — sample from p(x) — into T tractable sub-problems, each one a short-range regression. You walk down an easy staircase instead of parachuting out of a helicopter.

Noise schedule (personified)
I'm the timekeeper. I decide how much marble the sculptor covers the statue in at each step — a dusting at first, then bigger handfuls toward the end. Linear, cosine, learned, it's all me. Get me wrong and your chisel has nothing to learn (strokes too small) or everything to learn at once (strokes too big). I don't get credit in the paper but I run the show.

An obvious question at this point: if the chisel is so good, why can't we just do this in one swing? Point the network at the block of static, say “give me the cat,” and let it rip. Why a thousand strokes instead of one?

Because the one-stroke version is exactly the problem diffusion was invented to avoid. Asking a network to map pure static directly to a photograph is asking Michelangelo to look at an uncarved block and chisel the David in a single motion, blindfolded, with the power of his mind. The mapping from 𝒩(0, I) to the distribution of real images is wildly non-linear, multi-modal, and contested — for any patch of static, there are a trillion equally plausible photographs it could become. A one-shot model has to pick one of those trillion on the spot, with no context, no intermediate goalpost, nothing. GANs tried this. Their training was an absolute horror show for exactly this reason.

Split the job into a thousand strokes and the math cooperates. Each stroke looks at marble that's already most of the way to being a statue and chips away a sliver. The ambiguity is local — “is this particular fleck signal or waste?” — not global. A small network with a plain MSE loss can learn that. Chain a thousand of those small learned moves together and the composition recovers the full distribution. The sculptor does the David one chisel stroke at a time because no sculptor, human or mechanical, can do it in one.

One equation, no derivation. This is the forward process — the rule that defines the staircase, the rate at which the sculptor buries the statue under chips of marble:

one step of Gaussian corruption — the whole forward process
q(xₜ | xₜ₋₁)  =  𝒩( xₜ ;  √(1 − βₜ) · xₜ₋₁ ,  βₜ · I )

          ┌──────────────────┐          ┌─────────────┐
          │  scale the old   │          │   add noise │
          │  image down a    │          │   variance  │
          │  little          │          │   βₜ        │
          └──────────────────┘          └─────────────┘

       where  βₜ  is small  (~10⁻⁴ at t=1, ~0.02 at t=T)

Read it as a sentence: to get the image at step t, take the image at step t−1, shrink it slightly, and sprinkle in a dash of Gaussian noise. That's the forward process in full. There is no neural network here. There is no training. It is a fixed, hand-specified recipe for burying a statue under marble dust one sprinkle at a time.

Because βt is small, one step is barely visible. Because you do it a thousand times, the end state xT is indistinguishable from standard Gaussian noise — 𝒩(0, I). That's the crucial property: the static on the top rung of the staircase is a distribution you already know how to sample from. Draw random numbers, start chiselling.

Images are hard to reason about in 196,608 dimensions. Let's drop to two. Below is a 2D Gaussian blob — stand-in for any structured data distribution, or if you prefer, a flat-projection David. Watch noise bury it into a featureless cloud of static. Then watch a learned chisel, one stroke at a time, pull the statue back out.

the denoising task — input, predicted noise, reconstruction
loss = ‖ε̂ − ε‖² · β = 0.55
q = 0 predicts zeros (worst) · q = 1 recovers the exact noise (oracle)
noisy input x_t
sqrt(ᾱ)·x₀ + sqrt(1−ᾱ)·ε
U-Net
predicted noise ε̂
model output
x_t − sqrt(1-ᾱ)·ε̂
reconstruction x̂₀
(x_t − sqrt(1−ᾱ)·ε̂)/sqrt(ᾱ)
true ε (hidden) — model tries to recover this|mean ε̂| = 0.49
mse(ε̂, ε)0.172
recon qualityokay

Two things to notice. First, during the forward phase the blob doesn't vanish suddenly — it widens, flattens, blurs into the background. Each stroke of the sculptor's hammer covers up a sliver. Second, during the reverse phase the chisel isn't drawing the blob from thin air — it's nudging points, one stroke at a time, back toward where the statue lived. The blob re-emerges in the right place because the chisel has, during training, learned where the data tends to sit. It knows what's signal and what's marble waste.

That's the geometric picture. The data distribution is a thin manifold in a high-dimensional space — the statue, hiding inside the block. The forward process smears mass off the manifold into the whole space, burying the statue. The reverse process is a learned vector field pointing back toward it. Generation is a walk along that field starting from a random point in the static.

The chisel (personified)
Give me a noisy tensor and a timestep, and I'll tell you what marble to flick away. I don't know what a cat is. I don't know what a sentence is. I know what the statue tends to look like and what the block tends to look like at every stage in between, and I close the gap. Repeat me a thousand times and you get a David.

You've seen it move. Now write it three times, each layer a bit closer to the real thing. First the noise schedule on a single scalar — the arithmetic laid bare. Then a 2D neural network toy in NumPy where we actually destroy and reconstruct a Gaussian. Finally, the PyTorch skeleton of the real algorithm, stripped to bones; the DDPM lesson later fills in the rest. Each layer uses training — the same loop you already know — with a comically simple loss function.

layer 1 — pure python · noise_schedule.py
python
import math, random

T = 1000
beta_start, beta_end = 1e-4, 2e-2
betas = [beta_start + (beta_end - beta_start) * t / (T - 1) for t in range(T)]

# Forward process on a single scalar — start at x=1.0, bury it step by step.
random.seed(0)
x = 1.0
var_accum = 0.0                         # variance of noise accumulated so far
for t in range(T):
    x = math.sqrt(1 - betas[t]) * x + math.sqrt(betas[t]) * random.gauss(0, 1)
    var_accum = (1 - betas[t]) * var_accum + betas[t]
    if t in (0, 250, 500, 750, 999):
        print(f"t={t:<4} β={betas[t]:.4f}   x={x:+.4f}   "
              f"noise var so far={var_accum:.4f}")
stdout
t=0   β=0.0001   x=1.0000   noise var so far=0.0000
t=250 β=0.0051   x=0.2873   noise var so far=0.9175
t=500 β=0.0101   x=0.0112   noise var so far=0.9999
t=750 β=0.0151   x=0.0001   noise var so far=1.0000
t=999 β=0.0200   x=0.0000   noise var so far=1.0000

The original signal decays, the accumulated noise variance saturates at 1. By step 1000 the scalar is, for all intents and purposes, a draw from 𝒩(0, 1). Whatever it was at the start is gone — statue completely buried under marble. That's the forward process finishing its job.

Step up a dimension. In NumPy we can actually destroy a 2D distribution and train a tiny chisel to put it back. This is the real algorithm in miniature — same losses, same schedule, same sampling loop, just tractable enough to run in a notebook.

layer 2 — numpy · diffusion_2d.py
python
import numpy as np

rng = np.random.default_rng(0)
T = 100
betas = np.linspace(1e-4, 2e-2, T)
alphas = 1.0 - betas
alpha_bar = np.cumprod(alphas)           # α̅ₜ = ∏ (1 - βᵢ)

# True data: a tight 2D Gaussian blob.  Pretend this is an image dataset.
x0 = rng.normal(loc=[2.0, -1.0], scale=0.3, size=(1024, 2))

# Forward in closed form — no looping needed.
# xₜ = √α̅ₜ · x₀ + √(1 - α̅ₜ) · ε       (this identity falls out of the schedule)
def q_sample(x0, t):
    ab = alpha_bar[t]
    eps = rng.standard_normal(x0.shape)
    return np.sqrt(ab) * x0 + np.sqrt(1 - ab) * eps, eps

# A trivially-parametric "chisel": a learned mean for each t.
# Stand-in for a neural net — shows the shape of the problem.
mu = np.zeros((T, 2))
for step in range(5_000):
    t = rng.integers(0, T)
    xt, eps = q_sample(x0, t)
    pred = np.mean(xt, axis=0)           # our "model" predicts the batch mean
    mu[t] = 0.99 * mu[t] + 0.01 * pred   # EMA update — cartoon of gradient descent

# Reverse: sample pure static, chisel step by step using learned means.
x = rng.standard_normal((256, 2))
for t in reversed(range(T)):
    x = (x - np.sqrt(1 - alpha_bar[t]) * (x - mu[t])) / np.sqrt(alpha_bar[t])
print("recovered mean:", np.round(x.mean(axis=0), 2), "   target: [2. -1.]")

That's the whole pipeline compressed to 20 lines. A real DDPM swaps the mu[t] table for a U-Net that takes (xt, t) and returns the noise to subtract — the chisel stroke for that timestep — but the scaffolding (schedule, forward equation, reverse loop) is exactly this. When you read Ho et al. 2020, you're reading a more careful, conditional, parameterised version of the snippet above. For images the chisel uses convolutions because they respect the 2D locality of pixels; the same scaffolding holds.

layer 3 — pytorch skeleton · ddpm_sketch.py
python
import torch
import torch.nn as nn
import torch.nn.functional as F

T = 1000
betas = torch.linspace(1e-4, 2e-2, T)
alphas = 1.0 - betas
alpha_bar = torch.cumprod(alphas, dim=0)

class Denoiser(nn.Module):
    """Predicts the noise ε added to x₀ to produce xₜ.  A U-Net in real life."""
    def __init__(self):
        super().__init__()
        # ...architecture lives here; filled out in the DDPM lesson...

    def forward(self, x_t, t):
        # Returns ε̂ with shape matching x_t.
        ...

model = Denoiser()
opt = torch.optim.Adam(model.parameters(), lr=2e-4)

# Training step — one of the simplest losses in deep learning.
def train_step(x0):
    t = torch.randint(0, T, (x0.size(0),))
    ab = alpha_bar[t].view(-1, 1, 1, 1)
    eps = torch.randn_like(x0)
    x_t = ab.sqrt() * x0 + (1 - ab).sqrt() * eps     # forward, closed-form
    eps_hat = model(x_t, t)
    loss = F.mse_loss(eps_hat, eps)                  # predict the noise — that's it
    opt.zero_grad(); loss.backward(); opt.step()
    return loss.item()

# Sampling — what the user sees at inference time.
@torch.no_grad()
def sample(shape):
    x = torch.randn(shape)                           # start from pure static
    for t in reversed(range(T)):
        eps_hat = model(x, torch.full((shape[0],), t))
        # ... plug eps_hat into the reverse mean/variance formulas ...
    return x
pure python → numpy → pytorch
scalar: x = √(1-β)·x + √β·ε←→batched: x_t = √α̅·x₀ + √(1-α̅)·ε

one step per call → one call per timestep (α̅ folds T steps into one)

learn a table mu[t]←→learn a network eps_hat = model(x_t, t)

the toy table becomes a U-Net conditioned on the timestep

reverse loop on NumPy array←→@torch.no_grad() sampling loop on GPU

same logic; autograd off because we aren't training during sampling

How many chisel strokes should the staircase have? More strokes mean each one is smaller, which means the chisel has an easier regression target at every single step. Quality goes up. But generation time scales linearly with T — 1000 strokes is 1000 forward passes of the U-Net, which for a big model is measured in seconds or tens of seconds per sample.

Original DDPM used T = 1000. Modern accelerated samplers — DDIM, DPM-Solver, and their descendants — reframe the reverse process as an ODE you can integrate with 50 or 20 or even 4 strokes, recovering most of the quality at a fraction of the cost. The training schedule stays at T=1000; only inference gets cheap. We'll spell that out in the samplers lesson.

Gotchas

“Predict the noise” vs “predict the clean image”: both parameterisations work — they're algebraically equivalent given the schedule. Most implementations predict ε (the marble dust) because the loss landscape is friendlier; some predict x₀ (the finished statue) directly; some predict v = α̅½ε − (1−α̅)½x₀ (the v-prediction from Progressive Distillation). Pick one and be consistent. Mixing them silently is the kind of bug that burns a week.

Schedule bugs: off-by-one indexing on α̅t, using β where the code expects α = 1 − β, computing cumprod in the wrong direction. These produce training runs that look like they're working — loss goes down — and then sample pure static at inference. Always validate by decoding a known noisy sample end-to-end before touching the model.

T too small: if the forward process doesn't fully bury the statue by step T, xT is not Gaussian, and sampling from 𝒩(0, I) at inference means the reverse process starts from the wrong block of marble. Quality drops. Confirm by pushing a real image all the way through the forward schedule and measuring its mean and variance; it should look like pure static.

Chisel a 2D spiral out of static

Generate 2048 points along a 2D Archimedean spiral — call this your data distribution x₀, the statue to uncover. Build a small MLP (3 hidden layers, 128 units, SiLU) that takes (xt, t) as input and outputs a noise prediction in ℝ². Train it for 10,000 steps with the DDPM loss from layer 3.

Now sample 512 points from 𝒩(0, I) — 512 tiny blocks of static — and run the reverse process three times: once with T = 10 chisel strokes, once with T = 50, once with T = 1000. Scatter-plot each against the original spiral.

With T=10 you'll see a smeary blob near the spiral — the sculptor cut too much in each swing. At T=50 the spiral shape starts to emerge. At T=1000 it should be crisp and distributed all along the curve. That's the bias-variance trade-off of the step count, visualised on something you can plot.

What to carry forward. Diffusion splits generative modeling into a fixed corruption process and a learned inverse — the sculptor burying the statue under marble, and the chisel uncovering it again. Each stroke of the chisel is a small, local denoising job. Training is plain MSE against the noise you added — no adversary, no latent-space bargaining. Generation is a walk from 𝒩(0, I) down the staircase, with the learned chisel as your guide at each rung. Everything else — the U-Net, the attention conditioning, the classifier-free guidance, the latent-space compression that makes Stable Diffusion fast — is engineering on top of this one idea.

Next up — Forward & Reverse Diffusion. We've been loose with the math so far, reading equations like sentences. In forward-and-reverse-diffusion we derive them properly. The closed-form expression for q(xt | x0) — why you don't actually loop T times during training. The posterior q(xt-1 | xt, x0) that gives the reverse process its mean and variance. The ELBO that collapses, miraculously, into the MSE loss we just wrote. A single page of algebra that turns “chisel away the noise” from a metaphor into a theorem.

References