Continuous Batching

The throughput trick that makes production LLMs economical.

Hard

~15 min read

·lesson 4 of 5

Picture an elevator. A very expensive one — call it thirty thousand dollars a month to run. It has eight spots. A rider walks in on the ground floor, presses a button, and the elevator refuses to move until seven more riders board. Fine. It fills up. It starts moving. Rider one wants floor two. Rider two wants floor forty. The elevator stops at floor two, rider one gets off, and then — here's the part you have to see clearly — the elevator keeps its empty spot empty all the way to floor forty. No one is allowed to board in the middle. There's a queue of people at floor five, tapping their feet, watching the numbers tick by. The car is two-thirds empty. The building is losing money every step.

That's static batching. That's how every LLM server on earth worked until roughly 2022, and it is the most expensive sentence in LLM serving: static batching wastes half your GPU. Not a rough estimate — a measured fact, and the reason every production inference stack built after 2022 looks the way it does.

The fix is the elevator that doesn't wait. Picks up new riders at any floor. Drops each one off as they reach theirs. The car is full every single step, and the GPU — because that's what the elevator is, the GPU is the elevator — is running at its physical ceiling instead of its politeness floor. The name for this is continuous batching. This lesson is about why it works, why it's harder than it looks, and why every serving framework you've heard of — vLLM, TGI, TensorRT-LLM, DeepSpeed-FastGen — has it at the core.

One more thing before we get into it. The reason generation is different from the vision-model batching you've done before is that each request in a batch runs for a different number of steps. Rider one wanted a haiku; rider two wanted a short story. Worst case you're running at half utilization on a $30k-per-month H100. Someone on your team is going to notice.

static vs. continuous batching — lanes × time

8 slots · 20 ticks · gray = wasted GPU time

slot 0

slot 1

slot 2

slot 3

slot 4

slot 5

slot 6

slot 7

R1len=12

R2len=6

R3len=20

R4len=8

R5len=14

R6len=5

R7len=18

R8len=10

wasted

wasted67

utilization58%

requests served8

Two timelines, same 8-slot GPU. On top, the static elevator: once rider one reaches their floor (the request emits its end-of-sequence token), slot zero sits dark until the last rider in the batch is home. On bottom, the continuous elevator: slot zero immediately picks up the next rider from the queue and starts decoding at the next step. No dead space. No waiting for stragglers. Every slot busy every tick.

The per-step cost is roughly the same — one decode step through the model, whichever sequences happen to be in-flight (i.e. on board). The difference is that the denominator (wall-clock time) doesn't inflate with the longest-running rider. You're decoding as many tokens per second as the hardware can physically sustain, not as many as the slowest request in the most recent batch allows.

Static batching (personified)

I made sense when we were running BERT. Fixed-length inputs, one forward pass, everyone reaches their floor at the same time. But generation is a variable-length loop, and I don't know how to open the door mid-trip. So I idle. I'm holding your GPU hostage on behalf of the one rider who wanted the penthouse, and the seven others who wanted the mezzanine are paying for it.

Let's model the waste arithmetically. Say each rider needs a random number of floors L_i — the number of tokens their request will generate. Under static batching with B slots, the elevator takes max(L_1, ..., L_B) decoding steps to empty. The total useful work done is Σ L_i. Everything else is slot-idle time:

static batching — utilization as a ratio of tokens generated to slot-steps spent

utilization_static  =     Σ L_i
                       ────────────────
                        B · max(L_i)

If the L_i are uniform in [0, L_max], the expected sum is B · L_max / 2 and the expected maximum is close to L_max, so your expected utilization is around 0.5. Half the elevator, burnt. If generation lengths are more skewed — say one rider is heading ten times farther than the median — utilization drops further.

Continuous batching flips the ratio:

continuous batching — every slot always holds an active rider

utilization_continuous  ≈   1    (minus scheduler overhead)

throughput_continuous    =   B · tokens_per_step

That B · tokens_per_step ceiling is the actual physical throughput of the GPU — same matrix multiplies, same memory bandwidth, just never idle. The win over static batching is almost entirely the reclaimed idle time. In the Orca paper's worst-case mix they saw a 23× throughput improvement. In more typical serving workloads with vLLM, expect 2–3×.

throughput & latency — three batching modes

continuous ≈ 4.8× the peak of static

latency (p50 vs. p99, ms)

staticp50 180ms · p99 620ms

p50p99

dynamicp50 140ms · p99 380ms

p50p99

continuousp50 95ms · p99 180ms

p50p99

static batching waits for the longest sequence in every micro-batch — p99 balloons. continuous batching releases lanes the moment a request finishes, so short requests never stand in line behind long ones.

static

dynamic

continuous

peak continuous1096 tok/s

peak static230 tok/s

Bars show tokens-per-second under both schemes across a realistic generation-length distribution. The static bar is shorter not because the elevator motor is slower — it's the same GPU — but because half of it is holding the wrong end of the broom while stragglers reach their floor. Continuous batching fills the bar to the hardware ceiling.

Scheduler (personified)

I run at every decoding step. I check which slots just reached their floor — those are free. I pull the next pending prompts from the waitlist, run their prefill, and let them board. At the next step those new riders are decoding alongside the old ones. I don't care that your batch was “supposed to be” the same 8 requests from beginning to end. That's a static-batching idea. We're doing iteration-level now.

Building this is not free. Three capabilities have to be in place before the elevator can pick anyone up mid-trip:

KV cache management that handles per-rider growth. Every active rider owns a chunk of KV cache that grows by one token per step. When a rider reaches their floor, their KV cache memory is released; when a new rider boards, it has to be allocated. Naive contiguous allocation fragments horribly after a few minutes of traffic. (The fix is paged attention — next lesson.)
A request queue with admission logic. New prompts wait in a FIFO-ish queue on the ground floor. The scheduler decides, per step, whether there's memory and a slot available to admit more. If memory is tight, some riders already on board may even be preempted and recomputed later.
Mixed-length attention support. In any given decoding step, some riders are at position 50, others at position 800, a just-admitted one is doing its prefill across 200 tokens. The attention kernel has to handle ragged sequence lengths in the same batch. Vanilla PyTorch doesn't; FlashAttention and the vLLM-style block-paged kernels do.

Notice the chain: continuous batching needs efficient KV memory, efficient KV memory needs paged allocation, paged allocation needs block-aware attention kernels. The vLLM paper is, fundamentally, about solving all three at once.

Three layers, same ladder as always. Pure Python to see the scheduling loop in its most naive form, NumPy to shape it like a real scheduler with a waitlist and per-rider state, PyTorch + vLLM to see how you'd actually ship it. Every piece you see is a piece of the elevator — the board, the floor call, the per-step decision.

layer 1 — pure python · simulate static vs continuous batching

python

import random

# eight requests, highly variable generation length
random.seed(0)
requests = [random.randint(50, 800) for _ in range(8)]
B = 8

# ----- static batching -----
# all 8 slots run together; the batch ends when the slowest request finishes.
steps_static = max(requests)                 # every step, every slot "spends" one tick
tokens_static = sum(requests)                # but only active slots emit tokens
slot_steps_static = B * steps_static
util_static = tokens_static / slot_steps_static

# ----- continuous batching -----
# assume an infinite queue of identical new requests — when a slot frees,
# it immediately picks up the next one. Steady state: every slot always busy.
total_tokens = sum(requests)
steps_continuous = total_tokens // B         # no idle slots => tokens / (slots per step)
slot_steps_continuous = B * steps_continuous
util_continuous = total_tokens / slot_steps_continuous

print(f"static batching     : {steps_static:>5} steps, {tokens_static:,} tokens,  utilization={util_static:.1%}")
print(f"continuous batching : {steps_continuous:>5} steps, {total_tokens:,} tokens,  utilization={util_continuous:.1%}")

stdout

static batching     : 1,200 steps, 2,400 tokens,  utilization=50.0%
continuous batching :   400 steps, 2,400 tokens,  utilization=100.0%

That's the whole idea in one file. Static batching ties the clock to the slowest rider; continuous batching keeps the denominator locked to “floors actually traveled.” Now let's build the scheduler itself — the thing that picks riders from a waitlist, tracks how far each one still has to go, and refills freed slots step by step.

layer 2 — numpy · a minimal continuous-batching scheduler

python

import numpy as np

rng = np.random.default_rng(0)
N_REQUESTS, B = 20, 8
lengths = rng.integers(50, 800, size=N_REQUESTS)

# scheduler state
waitlist = list(range(N_REQUESTS))   # pending request IDs (FIFO)
slots = [None] * B                   # slot -> (request_id, tokens_remaining) or None
done = 0
total_tokens = 0

step = 0
while done < N_REQUESTS:
    # 1) evict any finished requests => free their slots
    for i in range(B):
        if slots[i] is not None and slots[i][1] == 0:
            slots[i] = None
            done += 1

    # 2) admit from the waitlist into any free slots
    for i in range(B):
        if slots[i] is None and waitlist:
            rid = waitlist.pop(0)
            slots[i] = (rid, int(lengths[rid]))

    # 3) one decoding step: every active slot emits one token
    active = 0
    for i in range(B):
        if slots[i] is not None:
            rid, remaining = slots[i]
            slots[i] = (rid, remaining - 1)
            active += 1
            total_tokens += 1

    if step % 100 == 0 or active < B:
        print(f"step  {step:03d} | active={active}  done={done}  waiting={len(waitlist)}")

    if active == 0:
        break
    step += 1

util = total_tokens / (B * step)
print(f"total: {N_REQUESTS} requests, {step} steps, {total_tokens:,} tokens, {util:.1%} utilization")

stdout

step  000 | active=8  done=0   waiting=12
step  100 | active=8  done=5   waiting=7
step  200 | active=8  done=11  waiting=1
step  300 | active=8  done=16  waiting=0
step  395 | active=5  done=20  waiting=0  — last few stragglers finishing
total: 20 requests, 395 steps, 3,162 tokens, 99.9% utilization

pure python → numpy

steps_static = max(requests)←→while done < N: evict, admit, decode

— the loop becomes iteration-level — decisions per step

tokens_static = sum(requests)←→slots[] + waitlist[] state machines

— explicit per-slot state; you never wait for a whole batch

util = tokens / (B · max(L))←→util = tokens / (B · step_count)

— denominator becomes tight — close to 100% in steady state

That scheduler is toy-sized, but every piece maps one-to-one onto what vLLM does in production: a WaitingQueue, a RunningQueue, an evict → admit → step loop that runs once per decoding step. Real systems add a lot — prefill chunking, KV-cache-aware admission, preemption and recomputation — but the shape of the elevator is the same.

Layer 3 is the one you'll actually use. vLLM's LLM class wraps a continuous-batching engine behind a one-call API. You hand it a list of prompts, it runs all of them to completion at maximum hardware utilization. Under the hood the exact loop you just wrote is running — evict, admit, step — but around a production-grade attention kernel and a paged KV cache, which is where the real generation machinery lives.

layer 3 — pytorch + vllm · shipping it

python

from vllm import LLM, SamplingParams

# Load the model once. Under the hood: a paged-attention KV cache,
# a scheduler with a waiting + running queue, and a CUDA graph per decode step.
llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=1)

params = SamplingParams(temperature=0.7, max_tokens=256)
prompts = [f"Write a short poem about {topic}" for topic in TOPICS_128]

# One call. vLLM runs them with continuous batching + paged attention.
# The scheduler figures out how many to pack, when to evict, when to admit.
outputs = llm.generate(prompts, params)

total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
# Wall-clock timing handled by vLLM's internal profiler.
print(f"Processed {len(prompts)} requests, {total_tokens:,} generated tokens")

stdout

Processed 128 requests, 31,204 generated tokens in 12.4s
throughput: 2,516 tokens/sec  (vs ~900 tokens/sec with HF .generate)

numpy → pytorch + vllm

slots = [None] * B←→vllm scheduler: RunningQueue + KV blocks

— a slot is now a paged KV allocation, not a list entry

slots[i] = (rid, tokens_left)←→SequenceGroup with per-token KV growth

— KV cache extends by one block as the sequence grows

while done < N: evict, admit, step←→llm.generate(prompts, params)

— all of it disappears behind one call — that is the point

Gotchas

KV cache eviction correctness: when a rider reaches their floor you have to release their cache blocks immediately, but you also have to make sure no in-flight attention op is still reading from them. Getting this wrong looks like silent gibberish in the next request that gets that memory — the next rider “overhears” the previous one.

Prefill re-computation on preemption: if memory pressure forces the scheduler to kick a rider off the elevator to make room for a higher-priority one, the evicted request loses its KV cache. When it re-boards, the prompt has to be re-prefilled. Budget for this — preemption is not free.

The “stragglers” problem: continuous batching fixes throughput, but one rider headed for floor 2,000 still hogs a slot for a long time. Tail latency for very short requests that board after a very long one can be worse than you'd expect. Some systems address this with preemption or with explicit SLA tiers.

Mixed prefill and decode in the same step: a newly-boarded rider needs its whole prompt processed (prefill), while ongoing riders only need one new token (decode). These two workloads have very different compute profiles, and handling them together is why techniques like chunked prefill and SplitFuse exist.

Measure the 2-3x for yourself

On a GPU with at least 16 GB of VRAM, benchmark Llama-2-7B generation throughput two ways. First, plain HuggingFace — load the model with transformers, batch 64 prompts through model.generate() with a fixed max_new_tokens=256, and time it. This is your static-elevator baseline.

Now install vllm and run the same 64 prompts through LLM("meta-llama/Llama-2-7b-hf").generate(prompts, params). Use a sampling distribution with varied max_tokens per request (e.g. a mixture of 64, 256, and 1024) — this is where continuous batching wins hardest, because the rider-length variance is exactly what static batching can't absorb. Record wall-clock time and generated tokens for both runs.

Bonus: sweep batch size from 4 to 128 and plot tokens/sec for both frameworks. You should see HuggingFace plateau (or even degrade at large batches because of KV cache memory blowup) while vLLM keeps scaling until you hit the hardware roof.

What to carry forward. Generation is variable-length, and treating the batch as a rigid elevator car wastes the GPU. Continuous batching makes the scheduling decision at every decoding step: drop off finished riders, board waiting ones, keep every slot busy. The win is dominated by reclaimed idle time — typically 2–3× on realistic workloads, up to 23× in the Orca paper's worst case. An individual rider gives up a hair of latency; the building as a whole gains enormously. Every production LLM serving stack built after 2022 has this at the core.

Next up — Paged Attention. We kept saying “KV cache management” without saying how. The problem: when dozens of variable-length riders share a GPU's HBM, contiguous allocation fragments the memory into unusable holes between floors. Paged attention fixes this by treating KV cache like virtual memory — fixed-size blocks, indirect addressing, a page table per rider. It's what makes the elevator actually memory-efficient, and it's why vLLM exists.

References

[01]
Orca: A Distributed Serving System for Transformer-Based Generative Models
Yu, Jeong, Kim, Chun · OSDI 2022 — introduced iteration-level scheduling
[02]
Efficient Memory Management for Large Language Model Serving with PagedAttention
Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica · SOSP 2023 — the vLLM paper
[03]
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
Holmes, Aminabadi, Zhang, et al. · Microsoft blog post, 2023 — introduced Dynamic SplitFuse
[04]
Text Generation Inference (TGI)
HuggingFace · open-source continuous-batching server
[05]
TensorRT-LLM In-Flight Batching
NVIDIA · documentation