Latent Diffusion

Why Stable Diffusion works in VAE latent space.

Hard

~15 min read

·lesson 6 of 6

A 512×512 RGB image has 786,432 numbers in it. A diffusion model has to run its U-Net on every one of those numbers, at every one of 1,000 denoising steps, for every sample. Do the multiplication: that's the compute budget of a small country per batch of pretty pictures. On an A100 it's tens of seconds per image. On your laptop it's a coffee break. The DDPM paper (2020) was a miracle, but it was also a compute wall — diffusion stuck in pixel space is a pile of pixel arithmetic the field could not scale.

Here's the move that broke the wall. Before diffusion runs, take every training image to a thumbnail shop: a small network compresses each 512×512 photo down to a 64×64 tile — eight times smaller on a side, a quarter of a percent of the bits — and stores it there. Run the entire noisy denoising loop on those thumbnails, never on the real photos. When you're done, hand the finished thumbnail back to the shop and it upscales it into a full image. That's latent diffusion in one sentence. The UNet doesn't know it's working on thumbnails — it just sees tensors.

The thumbnail shop is an autoencoder — specifically a VAE — trained once and frozen. Latent Diffusion Models (Rombach et al., 2022) took this embarrassingly simple idea and turned it into Stable Diffusion. Pixel diffusion wanted to paint every brushstroke on the full canvas. This lesson hires the thumbnail shop instead, diffuses on the compressed tile, and upscales once at the end. Same final image, a hundredth of the bill.

By the end you should be able to sketch the Stable Diffusion architecture on a napkin — VAE encoder, latent U-Net, text encoder, VAE decoder — and not lie about any of it.

First the arithmetic that forced the whole field to change course. Here's what “diffuse directly on pixels” actually asks of a GPU — and what the thumbnail shop buys you in return:

the pixel-space cost — and what compression buys you

pixel space:   512 × 512 × 3         =     786,432  dims
latent space:   64 ×  64 × 4         =      16,384  dims

compression ratio:
    786,432 / 16,384  =  48×    fewer numbers to touch per step

diffusion runs ~1000 steps per sample, so the savings compound:
    pixel:   786,432 × 1000  ≈  7.9 × 10⁸   dim·steps
    latent:   16,384 × 1000  ≈  1.6 × 10⁷   dim·steps

same image, ~48× less compute. (U-Net flops scale super-linearly
in spatial size, so the real wall-clock savings are often larger.)

The widget below puts dollars on it. Slide the resolution and the downsample factor and watch the training cost estimate move. Stable Diffusion 1.5 was trained for roughly $600k on LAION. The pixel-space equivalent — same U-Net, same steps, no thumbnail shop — would have cost well north of $30M, which is why it was never built.

diffuse in pixel space vs compressed latent space

Rombach et al. 2022 — Stable Diffusion's core trick

pixel space — 512×512×3

pixel count

262,144

memory (fp32)

3.15 MB

FLOPs / step (relative to largest bar)

attention

21.99 T

conv

241.59 G

norm

83.89 M

total

22.23 T

latent space — 64×64×4

pixel count

4,096

memory (fp32)

65.5 KB

FLOPs / step (relative to largest bar)

attention

5.37 G

conv

3.77 G

norm

1.31 M

total

9.14 G

64× spatial reduction

64× fewer pixels · 2431.1× fewer FLOPs

image512×512downscale f8×

pixels saved64×

FLOPs speedup2431.1×

Pay attention to what happens at 8× downsample (the Stable Diffusion default). The cost drops by almost two orders of magnitude, and — here's the miracle — perceptual quality barely moves. Push the compression further (16×, 32×) and the thumbnails get mangled: the upscaler can't recover fine detail once you've thrown it away. Pull back to 2× and you're paying for redundancy. 4× linear / 8× area downsample is the sweet spot the Rombach paper landed on after an ablation grid, and everyone since has basically used the same ratio.

VAE (the thumbnail shop) (personified)

I am the compressor and the upscaler. I eat a 512×512 image and spit out a 4×64×64 latent tile — 48× smaller, but still holding everything you need to reconstruct the picture. I was trained once, for a long time, with perceptual and adversarial losses, and now I'm frozen. Diffusion does its thing in my latent space; I just encode photos going in and upscale thumbnails coming out. I'm unglamorous infrastructure, and the entire generative image economy runs on me.

Zoom out. Stable Diffusion is not a single model — it's four trained components stitched together. Here's the forward pass for text-to-image, end to end, with the thumbnail shop at both ends of the loop:

the full latent diffusion pipeline — text to pixels

    "a cat wearing a space helmet"
              │
              ▼
    ┌─────────────────────┐
    │  CLIP / T5 encoder  │     frozen text tower
    └─────────────────────┘
              │   c ∈ R^(77 × 768)      conditioning vectors
              ▼
    ┌─────────────────────┐     z_T ~ N(0, I)   ← random latent noise
    │   Latent U-Net      │                      shape: 4 × 64 × 64
    │   (cross-attends c) │     loop T=1000 → 50 steps with DDIM
    └─────────────────────┘
              │   z_0        clean latent, 4 × 64 × 64
              ▼
    ┌─────────────────────┐
    │   VAE decoder       │     frozen, pretrained
    └─────────────────────┘
              │
              ▼
         512 × 512 × 3   RGB image

training objective (only the U-Net is learned here):
    L  =  E_{z_0, t, ε}  ‖ ε  −  ε_θ( z_t, t, c ) ‖²

       with z_0 = E_vae(x),   z_t = √ᾱ_t · z_0 + √(1−ᾱ_t) · ε

Three of the four blocks are frozen. The VAE encoder and decoder — the thumbnail shop's compress and upscale counters — were pretrained separately and never get touched during diffusion training. The text encoder (CLIP ViT-L for SD 1.5, T5-XXL for SD 3) is also frozen. Only the U-Net learns to denoise, and it learns inside the thumbnail shop's coordinate system. That factorization is the architectural insight — each module solves the problem it's good at, and none of them fight each other's loss.

Let's look at what the autoencoder actually does to an image. Pick a pair of source images, watch them get compressed down to their 4×64×64 thumbnails, and drag the slider to interpolate between those two latents and decode the result. This is the bit that makes the latent a useful substrate — linear mixes of thumbnails upscale into smooth, coherent blends of the source images, in a way that linear mixes of pixels never would.

2D VAE latent manifold — drag to decode

z ∈ ℝ² → x̂ · nearest cluster decides class identity

latent space — click / drag to move z

circlesquaretrianglerings

decoder(z) → x̂

smooth interpolation between clusters — move between two colors to watch one shape morph into the other.

z_x0.00z_y0.00

nearest classtriangle

‖z‖0.00

The interpolation is the point. In pixel space, averaging two images gives you a literal double-exposure — two faces ghosted on top of each other. In the thumbnail shop's latent, averaging two encodings and upscaling gives you something that looks like a plausible face halfway between them. The latent coordinates parameterize a smooth manifold of “image-like things,” and that smooth manifold is exactly what diffusion needs in order to trace a path from noise to a realistic sample.

A quick note on VAE training, because the “V” matters. A vanilla autoencoder would learn to compress without any constraint on the latent distribution — the thumbnails could cluster into weird, disconnected islands. The VAE adds a KL term that pushes the latent distribution toward an isotropic Gaussian, which gives you the smooth, interpolable space you see above. Stable Diffusion's thumbnail shop goes further: it's trained with a perceptual loss (LPIPS — distance in a pretrained VGG feature space) and an adversarial loss (a discriminator, like in a GAN). Those two losses are what make the upscaler produce crisp, detailed reconstructions instead of the blurry porridge a plain L2-trained VAE gives you.

Latent diffusion (personified)

I'm the cheap path. Pixel diffusion wanted to hire a million workers to paint every brushstroke on a 512×512 canvas; I send the image to the thumbnail shop, let four workers sketch the composition on the tiny tile, and have the shop upscale it back at the end. Same final output, a hundredth of the bill. That's why Stable Diffusion runs on your gaming laptop and DALL·E 2 runs on a datacenter.

Three layers, same story as every other lesson: a numpy toy that just counts flops, a pytorch sketch of the training loop with a real thumbnail shop in the pipeline, and the one-line diffusers call that's how you'd actually use this in practice.

layer 1 — numpy · pixel_vs_latent_cost.py

python

import numpy as np

# image spec
H, W, C = 512, 512, 3
pixel_dims  = H * W * C                     # 786,432
latent_dims = (H // 8) * (W // 8) * 4       #  16,384   (8× spatial, 4 latent channels)

T           = 1000            # diffusion steps per sample
flops_per_d = 50              # crude per-dim U-Net flop estimate
samples     = 1_000_000_000   # SD-scale training dataset

pixel_cost  = pixel_dims  * T * flops_per_d
latent_cost = latent_dims * T * flops_per_d

print(f"pixel-space:  {pixel_dims:>7,} dims × {T} steps × {flops_per_d} flops/dim  =  {pixel_cost:.2e} flops/sample")
print(f"latent-space: {latent_dims:>7,} dims × {T} steps × {flops_per_d} flops/dim  =  {latent_cost:.2e} flops/sample")
print(f"speedup: {pixel_cost / latent_cost:>39.1f}×")

# assume $X per 1e15 flops on an A100
dollars_per_pflop = 0.04
print(f"for {samples/1e9:.0f}B training samples (SD-class run):")
print(f"  pixel training cost estimate:   ${pixel_cost  * samples / 1e15 * dollars_per_pflop:>12,.0f}")
print(f"  latent training cost estimate:  ${latent_cost * samples / 1e15 * dollars_per_pflop:>12,.0f}")

stdout

pixel-space:  786,432 dims × 1000 steps × 50 flops/dim  =  3.93e+10 flops/sample
latent-space:  16,384 dims × 1000 steps × 50 flops/dim  =  8.19e+08 flops/sample
speedup:                                                 48.0×
for 1B training samples (SD-class run):
  pixel training cost estimate:   $30,720,000
  latent training cost estimate:      $640,000

That's the motivation. Now the thing itself — one training step of an LDM, using a real pretrained thumbnail shop to encode pixels, sampling a timestep, adding noise, predicting it back. This is what actually runs inside Stable Diffusion's train_step().

layer 2 — pytorch · ldm_train_step.py

python

import torch
import torch.nn.functional as F
from diffusers import AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTokenizer, CLIPTextModel

# ─── frozen substrate: VAE + text encoder ────────────────────────
vae        = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse").eval()
tokenizer  = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_enc   = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").eval()

for p in vae.parameters():      p.requires_grad_(False)
for p in text_enc.parameters(): p.requires_grad_(False)

# ─── trainable: the U-Net that denoises in latent space ──────────
unet       = UNet2DConditionModel(sample_size=64, in_channels=4,
                                  out_channels=4, cross_attention_dim=768)
scheduler  = DDPMScheduler(num_train_timesteps=1000)

def train_step(images, captions, optimizer):
    # 1. encode pixels to latents.  shape 3×512×512  →  4×64×64
    with torch.no_grad():
        latents = vae.encode(images).latent_dist.sample()
        latents = latents * 0.18215                # SD's magic scaling factor

    # 2. encode text  →  77 × 768  conditioning
    with torch.no_grad():
        tokens = tokenizer(captions, padding="max_length",
                           max_length=77, return_tensors="pt")
        cond   = text_enc(tokens.input_ids).last_hidden_state

    # 3. sample a timestep and a noise vector, noise the latent
    t       = torch.randint(0, 1000, (latents.size(0),))
    noise   = torch.randn_like(latents)
    noisy   = scheduler.add_noise(latents, noise, t)         # z_t

    # 4. predict the noise with the U-Net  (this is ε_θ(z_t, t, c))
    pred    = unet(noisy, t, encoder_hidden_states=cond).sample

    # 5. MSE between true noise and prediction  ← the diffusion loss
    loss    = F.mse_loss(pred, noise)
    loss.backward(); optimizer.step(); optimizer.zero_grad()
    return loss.item()

pixel-space DDPM → latent-space LDM

noise = randn(3, 512, 512)←→noise = randn(4, 64, 64)

— 48× fewer numbers per step — the whole point

ε_θ(x_t, t) U-Net over pixels←→ε_θ(z_t, t, c) U-Net over latents + cross-attn(text)

— same loss, different domain + conditioning on CLIP text

image = sample()←→latent = sample(); image = vae.decode(latent)

— one VAE decode at the very end — O(1) overhead per sample

In practice you never write any of that by hand. Hugging Face's diffusers library wraps the whole stack — thumbnail shop, text encoder, U-Net, scheduler, guidance, safety checker — behind one pipeline object. Three lines, one image.

layer 3 — diffusers · stable_diffusion.py

python

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

# One call: text → CLIP → 50-step DDIM latent denoise → VAE decode → pixels.
# Everything we built in layer 2 lives inside .__call__(...)
image = pipe(
    prompt="an astronaut riding a horse on Mars, photorealistic, 4k",
    num_inference_steps=50,
    guidance_scale=7.5,           # classifier-free guidance strength
).images[0]

image.save("astronaut_horse.png")

stdout

Loading pipeline components...
  - VAE              (frozen)
  - text_encoder     (frozen, CLIP-ViT-L)
  - unet             (trained, ~860M params)
  - scheduler        (DDIM, 50 steps)
100%|██████████| 50/50 [00:03<00:00, 16.21it/s]
saved to astronaut_horse.png  (512×512 RGB)

pytorch LDM → diffusers pipeline

vae, text_enc, unet, scheduler = … # assemble by hand←→pipe = StableDiffusionPipeline.from_pretrained(...)

— one call pulls all four trained components from the hub

for t in timesteps: z = denoise(z, t, c)←→pipe(prompt, num_inference_steps=50)

— the denoising loop is hidden inside the pipeline call

image = vae.decode(z) ; normalize ; to_uint8←→.images[0]

— you get a PIL.Image back — no scaling gymnastics

Gotchas

Upscaler reconstruction artifacts: SD's thumbnail shop has known failure modes — “fish-eye” artifacts on eyes, wobble on text, grid patterns on large flat areas. These aren't the U-Net's fault; the upscaler itself cannot reconstruct those regions cleanly. Upgrading to sd-vae-ft-mse or the SDXL VAE fixes most of them.

Wrong thumbnail shop for the LDM: every diffusion model is trained in its own autoencoder's coordinate system. Swapping in SDXL's VAE under a SD 1.5 U-Net gives you colorful static — the U-Net is denoising in the wrong coordinate frame. Keep the VAE paired with the U-Net you trained it under.

Not scaling the latent: SD multiplies encoded latents by 0.18215 before diffusion and divides by the same number after. Forget the scaling and your thumbnails are an order of magnitude larger than the U-Net was trained to expect — outputs come out washed out or saturated.

CFG in pixel space: classifier-free guidance (the guidance_scale=7.5 knob) is defined on the U-Net's predicted noise, which lives in the thumbnail. Some older code re-implements CFG by upscaling conditional and unconditional latents and blending the pixels. It doesn't work — you get ghosting and color bleed. Guide the latents, upscale once at the end.

Measure the thumbnail shop's reconstruction error

Load Stable Diffusion 1.5's VAE (stabilityai/sd-vae-ft-mse). Pick any 512×512 RGB image — a photograph, a screenshot, whatever. Send it through the compress side (encode to 4×64×64 thumbnail), then immediately through the upscale side (decode back to pixels) without running any diffusion at all. You now have a before and an after.

Compute three things: the mean-squared error between the original and the reconstruction; the fraction of pixels that changed by more than 5/255 (i.e. perceptibly); and a side-by-side difference image (amplify the diff by 10× so you can see where the upscaler struggled). If you're on a face, the eyes and teeth will light up. If you're on text, the letterforms will.

Bonus: repeat with the SDXL VAE (madebyollin/sdxl-vae-fp16-fix) and observe how much cleaner the reconstruction is. That delta is most of the quality jump between SD 1.5 and SDXL — more than half of SDXL's gain came from the better thumbnail shop, not the bigger U-Net.

What to carry forward. Latent diffusion is the trick that made generative images cheap: do the hard work (denoising) on a compressed thumbnail in a semantically dense latent, and let a pretrained autoencoder handle the compress/ upscale round-trip to and from pixels. The thumbnail shop is infrastructure, not a model you retrain. The U-Net does the same ε-prediction as pixel-space DDPM, just on 48× fewer numbers and with cross-attention on text. Stable Diffusion isn't a single architecture — it's four trained components (VAE encoder, text encoder, U-Net, VAE decoder) wired together, and the separation of concerns is why the whole stack is tractable.

End of the Diffusion section. You've walked from the forward noising process, through the reverse denoising ELBO, through the U-Net and DDIM, into classifier-free guidance and text conditioning, and out the other side with the thumbnail-shop architecture that actually powers Stable Diffusion. That's the whole story of a major generative modality.

Next section — Reinforcement Learning, starting with Markov Decision Processes. We leave the world where a model is handed a static dataset of images and a fixed loss. RL is the regime where an agent takes actions in an environment, receives rewards, and has to figure out a policy from scratch. No labels, no thumbnails — just states, actions, and a signal that says “that was better” or “that was worse.” The first lesson builds the contract: states s, actions a, transition probabilities P(s' | s, a), rewards r, and the Markov assumption that tomorrow only depends on today. Bellman equations instead of log-likelihoods, policy gradients instead of supervised MLE. Bring a clean notebook.

References

[01]
High-Resolution Image Synthesis with Latent Diffusion Models
Rombach, Blattmann, Lorenz, Esser, Ommer · CVPR 2022 — the Stable Diffusion / LDM paper
[02]
Auto-Encoding Variational Bayes
Kingma, Welling · ICLR 2014 — the original VAE
[03]
Taming Transformers for High-Resolution Image Synthesis
Esser, Rombach, Ommer · CVPR 2021 — the VQGAN paper that seeded the LDM VAE
[04]
diffusers — the Hugging Face diffusion library
von Platen et al. · github.com/huggingface/diffusers