Continuous Batching
The throughput trick that makes production LLMs economical.
Picture an elevator. A very expensive one — call it thirty thousand dollars a month to run. It has eight spots. A rider walks in on the ground floor, presses a button, and the elevator refuses to move until seven more riders board. Fine. It fills up. It starts moving. Rider one wants floor two. Rider two wants floor forty. The elevator stops at floor two, rider one gets off, and then — here's the part you have to see clearly — the elevator keeps its empty spot empty all the way to floor forty. No one is allowed to board in the middle. There's a queue of people at floor five, tapping their feet, watching the numbers tick by. The car is two-thirds empty. The building is losing money every step.
That's static batching. That's how every LLM server on earth worked until roughly 2022, and it is the most expensive sentence in LLM serving: static batching wastes half your GPU. Not a rough estimate — a measured fact, and the reason every production inference stack built after 2022 looks the way it does.
The fix is the elevator that doesn't wait. Picks up new riders at any floor. Drops each one off as they reach theirs. The car is full every single step, and the GPU — because that's what the elevator is, the GPU is the elevator — is running at its physical ceiling instead of its politeness floor. The name for this is continuous batching. This lesson is about why it works, why it's harder than it looks, and why every serving framework you've heard of — vLLM, TGI, TensorRT-LLM, DeepSpeed-FastGen — has it at the core.
One more thing before we get into it. The reason generation is different from the vision-model batching you've done before is that each request in a batch runs for a different number of steps. Rider one wanted a haiku; rider two wanted a short story. Worst case you're running at half utilization on a $30k-per-month H100. Someone on your team is going to notice.
Two timelines, same 8-slot GPU. On top, the static elevator: once rider one reaches their floor (the request emits its end-of-sequence token), slot zero sits dark until the last rider in the batch is home. On bottom, the continuous elevator: slot zero immediately picks up the next rider from the queue and starts decoding at the next step. No dead space. No waiting for stragglers. Every slot busy every tick.
The per-step cost is roughly the same — one decode step through the model, whichever sequences happen to be in-flight (i.e. on board). The difference is that the denominator (wall-clock time) doesn't inflate with the longest-running rider. You're decoding as many tokens per second as the hardware can physically sustain, not as many as the slowest request in the most recent batch allows.
I made sense when we were running BERT. Fixed-length inputs, one forward pass, everyone reaches their floor at the same time. But generation is a variable-length loop, and I don't know how to open the door mid-trip. So I idle. I'm holding your GPU hostage on behalf of the one rider who wanted the penthouse, and the seven others who wanted the mezzanine are paying for it.
Let's model the waste arithmetically. Say each rider needs a random number of floors L_i — the number of tokens their request will generate. Under static batching with B slots, the elevator takes max(L_1, ..., L_B) decoding steps to empty. The total useful work done is Σ L_i. Everything else is slot-idle time:
utilization_static = Σ L_i
────────────────
B · max(L_i)If the L_i are uniform in [0, L_max], the expected sum is B · L_max / 2 and the expected maximum is close to L_max, so your expected utilization is around 0.5. Half the elevator, burnt. If generation lengths are more skewed — say one rider is heading ten times farther than the median — utilization drops further.
Continuous batching flips the ratio:
utilization_continuous ≈ 1 (minus scheduler overhead) throughput_continuous = B · tokens_per_step
That B · tokens_per_step ceiling is the actual physical throughput of the GPU — same matrix multiplies, same memory bandwidth, just never idle. The win over static batching is almost entirely the reclaimed idle time. In the Orca paper's worst-case mix they saw a 23× throughput improvement. In more typical serving workloads with vLLM, expect 2–3×.
Bars show tokens-per-second under both schemes across a realistic generation-length distribution. The static bar is shorter not because the elevator motor is slower — it's the same GPU — but because half of it is holding the wrong end of the broom while stragglers reach their floor. Continuous batching fills the bar to the hardware ceiling.
I run at every decoding step. I check which slots just reached their floor — those are free. I pull the next pending prompts from the waitlist, run their prefill, and let them board. At the next step those new riders are decoding alongside the old ones. I don't care that your batch was “supposed to be” the same 8 requests from beginning to end. That's a static-batching idea. We're doing iteration-level now.
Building this is not free. Three capabilities have to be in place before the elevator can pick anyone up mid-trip:
- KV cache management that handles per-rider growth. Every active rider owns a chunk of KV cache that grows by one token per step. When a rider reaches their floor, their KV cache memory is released; when a new rider boards, it has to be allocated. Naive contiguous allocation fragments horribly after a few minutes of traffic. (The fix is paged attention — next lesson.)
- A request queue with admission logic. New prompts wait in a FIFO-ish queue on the ground floor. The scheduler decides, per step, whether there's memory and a slot available to admit more. If memory is tight, some riders already on board may even be preempted and recomputed later.
- Mixed-length attention support. In any given decoding step, some riders are at position 50, others at position 800, a just-admitted one is doing its prefill across 200 tokens. The attention kernel has to handle ragged sequence lengths in the same batch. Vanilla PyTorch doesn't; FlashAttention and the vLLM-style block-paged kernels do.
Notice the chain: continuous batching needs efficient KV memory, efficient KV memory needs paged allocation, paged allocation needs block-aware attention kernels. The vLLM paper is, fundamentally, about solving all three at once.
Three layers, same ladder as always. Pure Python to see the scheduling loop in its most naive form, NumPy to shape it like a real scheduler with a waitlist and per-rider state, PyTorch + vLLM to see how you'd actually ship it. Every piece you see is a piece of the elevator — the board, the floor call, the per-step decision.
import random
# eight requests, highly variable generation length
random.seed(0)
requests = [random.randint(50, 800) for _ in range(8)]
B = 8
# ----- static batching -----
# all 8 slots run together; the batch ends when the slowest request finishes.
steps_static = max(requests) # every step, every slot "spends" one tick
tokens_static = sum(requests) # but only active slots emit tokens
slot_steps_static = B * steps_static
util_static = tokens_static / slot_steps_static
# ----- continuous batching -----
# assume an infinite queue of identical new requests — when a slot frees,
# it immediately picks up the next one. Steady state: every slot always busy.
total_tokens = sum(requests)
steps_continuous = total_tokens // B # no idle slots => tokens / (slots per step)
slot_steps_continuous = B * steps_continuous
util_continuous = total_tokens / slot_steps_continuous
print(f"static batching : {steps_static:>5} steps, {tokens_static:,} tokens, utilization={util_static:.1%}")
print(f"continuous batching : {steps_continuous:>5} steps, {total_tokens:,} tokens, utilization={util_continuous:.1%}")static batching : 1,200 steps, 2,400 tokens, utilization=50.0% continuous batching : 400 steps, 2,400 tokens, utilization=100.0%
That's the whole idea in one file. Static batching ties the clock to the slowest rider; continuous batching keeps the denominator locked to “floors actually traveled.” Now let's build the scheduler itself — the thing that picks riders from a waitlist, tracks how far each one still has to go, and refills freed slots step by step.
import numpy as np
rng = np.random.default_rng(0)
N_REQUESTS, B = 20, 8
lengths = rng.integers(50, 800, size=N_REQUESTS)
# scheduler state
waitlist = list(range(N_REQUESTS)) # pending request IDs (FIFO)
slots = [None] * B # slot -> (request_id, tokens_remaining) or None
done = 0
total_tokens = 0
step = 0
while done < N_REQUESTS:
# 1) evict any finished requests => free their slots
for i in range(B):
if slots[i] is not None and slots[i][1] == 0:
slots[i] = None
done += 1
# 2) admit from the waitlist into any free slots
for i in range(B):
if slots[i] is None and waitlist:
rid = waitlist.pop(0)
slots[i] = (rid, int(lengths[rid]))
# 3) one decoding step: every active slot emits one token
active = 0
for i in range(B):
if slots[i] is not None:
rid, remaining = slots[i]
slots[i] = (rid, remaining - 1)
active += 1
total_tokens += 1
if step % 100 == 0 or active < B:
print(f"step {step:03d} | active={active} done={done} waiting={len(waitlist)}")
if active == 0:
break
step += 1
util = total_tokens / (B * step)
print(f"total: {N_REQUESTS} requests, {step} steps, {total_tokens:,} tokens, {util:.1%} utilization")step 000 | active=8 done=0 waiting=12 step 100 | active=8 done=5 waiting=7 step 200 | active=8 done=11 waiting=1 step 300 | active=8 done=16 waiting=0 step 395 | active=5 done=20 waiting=0 — last few stragglers finishing total: 20 requests, 395 steps, 3,162 tokens, 99.9% utilization
steps_static = max(requests)←→while done < N: evict, admit, decode— the loop becomes iteration-level — decisions per step
tokens_static = sum(requests)←→slots[] + waitlist[] state machines— explicit per-slot state; you never wait for a whole batch
util = tokens / (B · max(L))←→util = tokens / (B · step_count)— denominator becomes tight — close to 100% in steady state
That scheduler is toy-sized, but every piece maps one-to-one onto what vLLM does in production: a WaitingQueue, a RunningQueue, an evict → admit → step loop that runs once per decoding step. Real systems add a lot — prefill chunking, KV-cache-aware admission, preemption and recomputation — but the shape of the elevator is the same.
Layer 3 is the one you'll actually use. vLLM's LLM class wraps a continuous-batching engine behind a one-call API. You hand it a list of prompts, it runs all of them to completion at maximum hardware utilization. Under the hood the exact loop you just wrote is running — evict, admit, step — but around a production-grade attention kernel and a paged KV cache, which is where the real generation machinery lives.
from vllm import LLM, SamplingParams
# Load the model once. Under the hood: a paged-attention KV cache,
# a scheduler with a waiting + running queue, and a CUDA graph per decode step.
llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=1)
params = SamplingParams(temperature=0.7, max_tokens=256)
prompts = [f"Write a short poem about {topic}" for topic in TOPICS_128]
# One call. vLLM runs them with continuous batching + paged attention.
# The scheduler figures out how many to pack, when to evict, when to admit.
outputs = llm.generate(prompts, params)
total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
# Wall-clock timing handled by vLLM's internal profiler.
print(f"Processed {len(prompts)} requests, {total_tokens:,} generated tokens")Processed 128 requests, 31,204 generated tokens in 12.4s throughput: 2,516 tokens/sec (vs ~900 tokens/sec with HF .generate)
slots = [None] * B←→vllm scheduler: RunningQueue + KV blocks— a slot is now a paged KV allocation, not a list entry
slots[i] = (rid, tokens_left)←→SequenceGroup with per-token KV growth— KV cache extends by one block as the sequence grows
while done < N: evict, admit, step←→llm.generate(prompts, params)— all of it disappears behind one call — that is the point
KV cache eviction correctness: when a rider reaches their floor you have to release their cache blocks immediately, but you also have to make sure no in-flight attention op is still reading from them. Getting this wrong looks like silent gibberish in the next request that gets that memory — the next rider “overhears” the previous one.
Prefill re-computation on preemption: if memory pressure forces the scheduler to kick a rider off the elevator to make room for a higher-priority one, the evicted request loses its KV cache. When it re-boards, the prompt has to be re-prefilled. Budget for this — preemption is not free.
The “stragglers” problem: continuous batching fixes throughput, but one rider headed for floor 2,000 still hogs a slot for a long time. Tail latency for very short requests that board after a very long one can be worse than you'd expect. Some systems address this with preemption or with explicit SLA tiers.
Mixed prefill and decode in the same step: a newly-boarded rider needs its whole prompt processed (prefill), while ongoing riders only need one new token (decode). These two workloads have very different compute profiles, and handling them together is why techniques like chunked prefill and SplitFuse exist.
On a GPU with at least 16 GB of VRAM, benchmark Llama-2-7B generation throughput two ways. First, plain HuggingFace — load the model with transformers, batch 64 prompts through model.generate() with a fixed max_new_tokens=256, and time it. This is your static-elevator baseline.
Now install vllm and run the same 64 prompts through LLM("meta-llama/Llama-2-7b-hf").generate(prompts, params). Use a sampling distribution with varied max_tokens per request (e.g. a mixture of 64, 256, and 1024) — this is where continuous batching wins hardest, because the rider-length variance is exactly what static batching can't absorb. Record wall-clock time and generated tokens for both runs.
Bonus: sweep batch size from 4 to 128 and plot tokens/sec for both frameworks. You should see HuggingFace plateau (or even degrade at large batches because of KV cache memory blowup) while vLLM keeps scaling until you hit the hardware roof.
What to carry forward. Generation is variable-length, and treating the batch as a rigid elevator car wastes the GPU. Continuous batching makes the scheduling decision at every decoding step: drop off finished riders, board waiting ones, keep every slot busy. The win is dominated by reclaimed idle time — typically 2–3× on realistic workloads, up to 23× in the Orca paper's worst case. An individual rider gives up a hair of latency; the building as a whole gains enormously. Every production LLM serving stack built after 2022 has this at the core.
Next up — Paged Attention. We kept saying “KV cache management” without saying how. The problem: when dozens of variable-length riders share a GPU's HBM, contiguous allocation fragments the memory into unusable holes between floors. Paged attention fixes this by treating KV cache like virtual memory — fixed-size blocks, indirect addressing, a page table per rider. It's what makes the elevator actually memory-efficient, and it's why vLLM exists.
- [01]Yu, Jeong, Kim, Chun · OSDI 2022 — introduced iteration-level scheduling
- [02]Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica · SOSP 2023 — the vLLM paper
- [03]Holmes, Aminabadi, Zhang, et al. · Microsoft blog post, 2023 — introduced Dynamic SplitFuse
- [04]HuggingFace · open-source continuous-batching server
- [05]NVIDIA · documentation