Classifier-Free Guidance
Steer generation without an auxiliary classifier.
Picture an old radio with one dial and two antennae. The first antenna picks up a signal that says “a corgi wearing sunglasses” — loud, specific, demanding. The second antenna picks up the station the model would play if you hadn't asked for anything at all — generic ambient hum, the sound of some image, any image. Between those two signals is a single dial. Turn it one way and the radio ignores your request and plays the hum. Turn it the other way and the radio shouts your prompt at you in over-saturated technicolor. Finding the sweet spot in the middle is what modern image generation is, mostly.
That dial has a name: the guidance scale. The trick that makes it work — a single network trained to listen on both antennae — has a name too: Classifier-Free Guidance (Ho & Salimans, 2022). Every text-to-image model you've ever used — Stable Diffusion, DALL-E, Imagen, Midjourney, Flux — is one big CFG machine under the hood. When a user slides the “prompt adherence” knob in a UI, they are turning one number in the equation we're about to derive.
A bit of context before we start tuning the dial. A diffusion model trained on millions of images will happily generate plausible pictures forever — unconditionally, meaning without you asking for anything in particular. The hard problem is the next one: you type “a corgi wearing sunglasses” and you want a corgi wearing sunglasses, not a nice landscape. Getting a denoising model to actually obey a caption turns out to be a surprisingly deep research question.
The first real answer, from 2021, was classifier guidance: train a separate image classifier on noisy images, then use its gradient at each sampling step to nudge the sample toward the target class. It works. But now you need two models, and the classifier has to be trained on noisy data, and nobody wants two models.
CFG's answer is cleaner: don't bolt on a second model — teach the same model to speak on both channels. Randomly drop the caption during training so the network learns the prompted signal and the ambient hum in the same set of weights. At generation time, run the net twice per step, read both signals, and extrapolate along the direction between them. One dial, one knob, one hyperparameter. That's the whole algorithm.
On my own I'll draw you something. Attach a prompt and I'll try, but my obedience is negotiable. Crank the guidance dial up and I'll drop the hedge — I'll give you exactly what you asked for, for better or worse.
Here's the move, concretely. At every sampling step, run the U-Net twice: once with the prompt (conditional channel), once without (unconditional — you pass a null token, typically an empty string or a special learned embedding). You now have two noise predictions, two radio signals:
ε_cond— what the model thinks the noise is, given the prompt. The conditional signal.ε_uncond— what the model thinks the noise is with no prompt at all. The unconditional hum.
The difference ε_cond − ε_uncond is a direction in noise-prediction space. It points the way from “generic image” toward “this prompt”. The dial takes that direction and amplifies it:
ε̂_guided = ε_uncond + s · (ε_cond − ε_uncond)
←──── baseline ────→ ←── amplified direction ──→
where s = guidance scale (a.k.a. CFG scale, w, or guidance weight)
s = 0 → pure unconditional — ignores the prompt
s = 1 → normal conditional generation
s = 7.5 → Stable Diffusion default — strong adherence
s = 15–20 → aggressive — saturated colors, prompt dominatesRewrite the same equation as ε̂ = (1 − s) · ε_uncond + s · ε_cond and the structure becomes obvious: it's a linear extrapolation. When s > 1 you're not interpolating between the two signals, you're shooting past the conditional one in the direction it was already heading. That's the whole trick — crank the dial and the radio doesn't just play the prompted station louder, it extrapolates into a realm where the prompted signal is even more itself than it was to begin with.
Slide s from 0 to 20 and watch the dial work. At s = 0 the model generates whatever it wants — the prompt is invisible, pure unconditional hum. At s = 1 you get honest conditional generation; the prompt registers, but weakly. Crank past s = 5 and the prompt starts dominating — colors intensify, the subject centers itself, stylistic cues get loud. Around s = 15–20 you tip into the over-saturated regime: contrast blows out, fine detail collapses, and the image starts to look like a caricature of the prompt rather than a picture that matches it.
There is no “correct” position on the dial. It's a knob you tune per-prompt, per model, per use-case. s = 7.5 is the Stable Diffusion default not because it's theoretically optimal but because it looked good on a lot of prompts during development.
I'm the empty prompt — the ambient hum the model plays when no one's asked for anything. Subtract me from the conditional signal and whatever's left is the pure flavor of the prompt. Without me, the dial has nothing to amplify away from.
We skipped the most important part on purpose. How does one network produce both signals? It doesn't own two brains. There's no second unconditional model sitting in a locker. The whole thing rests on a single training trick, and it's the bit the “classifier-free” in the name is really about.
During training, with some small probability — usually 10% — you replace the real caption with a null embedding before feeding the batch to the network. Nine times out of ten the model sees “a red car on a coastal highway” and learns the conditional signal for that caption. The tenth time it sees a special empty token and learns the ambient hum — what noise, on average, looks like without any caption at all. Same weights. Same loss. One network quietly moonlighting as two.
That's why it's called classifier-free. No auxiliary classifier, no second model, no extra training pipeline. Just one line of code in the training loop that says sometimes the caption is missing, and suddenly you have two radio stations broadcasting from one transmitter.
The dial comes with a price tag. What you buy by cranking the guidance scale is prompt adherence — the image looks more like what you asked for. What you pay is diversity and naturalness. The intuition: amplifying the same direction step after step pushes the sample outside the data distribution the model was trained on. Colors saturate because the model is exaggerating the “this is a red apple” signal past what real red apples look like.
High w buys you photorealism that feels photographic — contrast, specular highlights, unmistakable subject-centering — and costs you every sample that doesn't fit that one aesthetic. Run the same prompt a hundred times at s = 2 and you get a hundred different corgis in a hundred different moods. Run it at s = 15 and you get the same corgi, same pose, same lens, a hundred times. Crank the dial, kill the diversity.
┌───────────────────────────────────────────────┐
│ adherence ↔ quality │
└───────────────────────────────────────────────┘
low s (0–3) : faithful to the data distribution
diverse samples, but may miss the prompt
e.g. "red car" → sometimes returns a blue one
mid s (5–9) : sweet spot for most text-to-image
prompt lands, artifacts stay rare
high s (12–20) : prompt dominates, samples look stylized
colors saturate, skin smooths, detail thins
diversity collapses — same prompt, same image
very high s (30+): model leaves the manifold of plausible images
artifacts, neon blobs, structural nonsensePrompt: “a cat”. The left panel shows ε_uncond — what the model predicts without the prompt, the ambient unconditional hum. The middle shows ε_cond — the cat-flavored conditional signal. The right panel shows the guided combination at the current dial setting. Turn up the scale and watch the guided prediction drift away from the unconditional baseline and past the conditional one. At s = 1 the guided prediction equals ε_cond exactly. At s = 10 it's somewhere well beyond.
This is what's actually happening at every single sampling step — all 50, or 20, or whatever you're using — throughout the denoising process. The U-Net runs twice per step, and you spend roughly 2× the compute for classifier-free guidance compared to plain conditional sampling. That's not a rounding error; it's why batched inference for diffusion models always pairs conditional and unconditional forward passes.
I'm the adherence dial. Turn me low and the model is tasteful but mumbly. Turn me high and the model shouts your prompt in saturated technicolor. Somewhere between five and ten I usually sit. Nobody knows the right value for me — you'll find it by trial and error.
The original paper frames CFG as implicit Bayes. If you train a single network to predict ε under both p(x|c) (conditional) and p(x) (unconditional), then sampling with guidance is approximately sampling from a sharpened distribution p(x|c)s · p(x)1−s. When s > 1 this distribution is peakier than the original p(x|c) — high-likelihood modes get amplified, low-likelihood ones get crushed. That's the math behind the intuition: cranking the dial concentrates mass on samples that are especially prompt-like.
You don't need to carry that derivation in your head, but it's useful to know: guidance trades diversity for peakiness. It is, literally and mathematically, a temperature-like sharpening of the conditional distribution.
Three layers, as always. First the update rule on toy signals — just to see the linear extrapolation doing its thing. Then the training trick (10% unconditional caption dropout) inside a PyTorch loop. Then the one-liner in diffusers that every production pipeline ultimately boils down to.
import numpy as np
# Pretend these are noise predictions from a denoiser at some step.
# Uncond = "nothing asked for" → flat-ish, low-information — the ambient hum.
# Cond = "given the prompt" → structured, peaked where the prompt wants signal.
uncond = np.full(10, 0.10)
cond = np.array([0.02, 0.04, 0.08, 0.14, 0.20, 0.20, 0.14, 0.08, 0.06, 0.04])
def cfg(uncond, cond, s):
return uncond + s * (cond - uncond) # the entire algorithm, one line
for s in [0.0, 1.0, 7.5]:
guided = cfg(uncond, cond, s)
print(f"s={s:<4} →", np.round(guided, 2))uncond = [0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10] cond = [0.02 0.04 0.08 0.14 0.20 0.20 0.14 0.08 0.06 0.04] s=0.0 → [0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10] s=1.0 → [0.02 0.04 0.08 0.14 0.20 0.20 0.14 0.08 0.06 0.04] s=7.5 → [-0.50 -0.35 -0.05 0.40 0.85 0.85 0.40 -0.05 -0.20 -0.35]
That's the sampling side of the dial. The training side is where CFG earns its name — you never train a separate classifier, you just randomly drop the caption some of the time so the same network learns both the conditional signal and the unconditional hum.
import torch
import torch.nn as nn
UNCOND_DROP_PROB = 0.10 # 10% of batches trained with the null condition
def train_step(unet, x0, cond_emb, null_emb, scheduler, optimizer):
# 1. Sample timestep and noise — ordinary DDPM.
t = torch.randint(0, scheduler.num_train_timesteps, (x0.size(0),), device=x0.device)
noise = torch.randn_like(x0)
xt = scheduler.add_noise(x0, noise, t)
# 2. CFG's one and only training trick:
# with prob p, swap the real conditioning for the null embedding.
mask = (torch.rand(x0.size(0), device=x0.device) < UNCOND_DROP_PROB).view(-1, 1, 1, 1)
c = torch.where(mask.expand_as(cond_emb), null_emb, cond_emb)
# 3. Predict noise and regress against the true noise — same MSE as always.
pred = unet(xt, t, c)
loss = nn.functional.mse_loss(pred, noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
@torch.no_grad()
def sample_with_cfg(unet, scheduler, cond_emb, null_emb, guidance_scale=7.5,
shape=(1, 3, 64, 64), num_steps=50):
x = torch.randn(shape, device=cond_emb.device)
scheduler.set_timesteps(num_steps)
for t in scheduler.timesteps:
# One batched forward pass over [uncond, cond] — twice the memory, one call.
x_in = torch.cat([x, x], dim=0)
c_in = torch.cat([null_emb, cond_emb], dim=0)
eps = unet(x_in, t, c_in)
eps_u, eps_c = eps.chunk(2)
# The CFG update rule — identical to layer 1. Crank s to crank adherence.
eps_hat = eps_u + guidance_scale * (eps_c - eps_u)
x = scheduler.step(eps_hat, t, x).prev_sample
return xuncond = np.full(10, 0.10)←→c = torch.where(mask, null_emb, cond_emb)— the unconditional branch isn't a separate model — it's the same model fed the null embedding
guided = uncond + s*(cond - uncond)←→eps_hat = eps_u + s*(eps_c - eps_u)— identical formula; the only difference is eps lives on a GPU now
(no training step — prediction only)←→mask = rand() < 0.10 → swap to null_emb— this one-line randomized dropout is what makes the two-signals-in-one-net trick work
The last layer is what 99% of production code looks like. diffusers hides all of the above — the two signals, the doubled forward pass, the extrapolation — behind a single argument.
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
prompt = "a corgi wearing sunglasses, 35mm film, sharp focus"
# The only CFG-relevant argument is guidance_scale — the dial itself.
# Everything you saw in layer 2 — null embedding, batched cond/uncond
# forward passes, the linear extrapolation — is happening inside this call.
for s in [1.0, 5.0, 7.5, 15.0]:
image = pipe(
prompt,
guidance_scale=s,
num_inference_steps=30,
).images[0]
image.save(f"corgi_s{s}.png")torch.cat([x, x]); unet(x_in, t, c_in).chunk(2)←→pipe(prompt, guidance_scale=s)— the double forward pass is still happening — it's just inside the pipeline
UNCOND_DROP_PROB = 0.10 at train time←→(already trained — baked into the checkpoint)— Stable Diffusion was trained with 10% uncond dropout; you inherit that
x = scheduler.step(eps_hat, t, x).prev_sample←→num_inference_steps=30— the denoising loop is a kwarg now
Forgetting the 10% dropout at train time: if the model never saw the null embedding during training, its unconditional signal at sampling time is garbage — there's no ambient hum, just noise — and cranking the dial just amplifies that garbage. The dropout is not optional; it's how the unconditional channel exists.
Applying CFG to a model not trained for it: passing a null prompt to a plain conditional diffusion model doesn't give you a sensible ε_uncond. You'll get out-of-distribution noise and weird, broken images. Guidance is a train-time and inference-time contract together.
Guidance only on some steps: some pipelines apply the dial only during part of the denoising trajectory (e.g. early steps) to save compute. This can work but subtly changes the output; if you compare samples across papers, check whether guidance was applied uniformly or scheduled.
Negative prompts: in Stable Diffusion UIs the “negative prompt” is literally the unconditional embedding being replaced with a prompt for things you don't want. The dial then extrapolates away from those things. Same equation, different null choice.
Very high scales (s > 20): the linear extrapolation leaves the data manifold and artifacts appear. Some pipelines use dynamic thresholding (Imagen) to clip extreme values back in range. Unexplained saturation at high settings is usually this.
Take a small class-conditional DDPM on CIFAR-10 (any of the reference implementations are fine — the whole model fits on a single GPU). Add the two CFG lines: (1) during training, drop the class label to a null token with probability 0.1; (2) during sampling, run the U-Net on both the class and the null token and combine with the guidance scale.
Now generate 16 samples each for class “cat” at s = 1, s = 5, and s = 10. Lay them out in a 3×16 grid.
What you should see as you crank the dial: at s = 1 some samples are clearly cats, others are ambiguous or drifting toward neighboring classes. At s = 5 almost every sample is unambiguously a cat, with more of the stereotypical “catness” the dataset contains. At s = 10 they're all very obviously cats, but diversity collapses and the colors start to look chalky or over-saturated. Document it in a short write-up.
Bonus: compute the FID score at each scale. You will almost always see FID degrade as s increases past 2–3 — even though human raters judge the high-scale samples as “more cat-like”. That gap is the fidelity-adherence tradeoff in numbers.
What to carry forward. CFG is one equation — ε̂ = ε_uncond + s·(ε_cond − ε_uncond) — and one training trick: randomly drop the caption 10% of the time so the same network learns both the conditional signal and the unconditional hum. At sampling you pay 2× compute for two forward passes, and you get a single dial (s) that trades diversity for prompt adherence. It is the standard conditioning mechanism in every text-to-image, text-to-audio, and text-to-video diffusion model deployed today.
The beyond-image extensions all use the same equation with different conditioning: AudioLDM guides on text → spectrograms, VideoDM guides on text → video frames, and there's a growing body of work applying CFG-style guidance to autoregressive LLMs. When you see “guidance scale” in a paper about anything generative, it's almost certainly this same dial between two signals.
Next up — Latent Diffusion. So far we've been running diffusion directly in pixel space — which is fine for CIFAR at 32×32 but catastrophically expensive at 1024×1024. Latent diffusion runs the whole process in a compressed VAE latent space instead, making Stable Diffusion-scale models possible on consumer hardware. It's the single architectural trick that made text-to-image go from research demo to phone app. We'll build it next.
- [01]Ho, Salimans · NeurIPS Workshop 2021 / arXiv 2207.12598 · 2022
- [02]Dhariwal, Nichol · NeurIPS 2021 — the classifier guidance paper · 2021
- [03]Rombach, Blattmann, Lorenz, Esser, Ommer · CVPR 2022 — Stable Diffusion · 2022
- [04]Nichol, Dhariwal, Ramesh, et al. · ICML 2022 — first large-scale text-to-image CFG · 2022
- [05]Saharia et al. · NeurIPS 2022 — Imagen, dynamic thresholding for high CFG · 2022
- [06]Meng, Rombach, Gao, et al. · CVPR 2023 — folding CFG into a single forward pass · 2023