Paged Attention
vLLM's virtual-memory-inspired KV cache.
Imagine you run a hotel. Every guest checks in and you hand them the exact same thing: the presidential suite. Doesn't matter if they're staying five minutes or five nights — they get the big room with the view and the extra closet. Most of them leave it untouched. When they check out you flip the mattress and hold the suite for the next guest, who also, obviously, gets a suite. You are running out of suites. The lobby is full of people you're turning away. The hotel, to use a technical term, is cooked.
That's what a naive KV cache allocator does to your GPU. Every new request arrives and is handed one long contiguous slab of memory sized to the worst case — max_seq_len, 4k or 32k or 128k tokens — regardless of how long the request actually turns out to be. A three-word question gets the same slab as a thousand-token essay. The three-word request finishes, its slab goes back on the shelf to be wasted by the next short request, and the accounting says your GPU is “full” while most of it is quite literally untouched bytes.
Then it gets worse. Continuous serving admits and evicts sequences of wildly different sizes. Those contiguous slabs finish at different times and leave a hole-shaped graveyard behind them. Memory fragmentation: total free bytes on the GPU are massive; none of the holes are big enough for the next incoming request. You're refusing customers in an empty restaurant because you can't find four chairs in a row.
The fix is stolen, wholesale, from the trick your operating system has been quietly running for fifty years. Stop handing out contiguous RAM. Break memory into fixed-size pages. Keep a per-sequence page table that maps the sequence's logical view of its cache onto whatever physical blocks happen to be free. Let every sequence's cache be a scattered chain of pages pulled from a shared pool — the GPU kernel can chase the pointers, the allocator can never get stuck. Welcome to PagedAttention, the vLLM-era idea that dragged LLM serving from ~30% memory utilization to 96%+ and made long-context, high-concurrency inference economically viable.
I give every request one contiguous slab of memory sized tomax_seq_lenon arrival. Most of them never fill it. When they finish I take the slab back. When a new request arrives I look for a single contiguous hole big enough to fitmax_seq_lenagain. On a busy GPU there often isn't one, even though I technically have the bytes. I am the reason you're getting OOMs with the GPU 40% empty.
Put numbers on the waste. A typical chat distribution: mean prompt length ~200 tokens, long tail out to 4k. If you allocate max_seq_len = 4096 per request and the actual completion averages 500 tokens, you're writing to 500 slots and reserving 4096. The rest is internal padding — allocated, charged to the request, never touched. On a paged allocator with block size 16, the same 500-token request uses 32 blocks and wastes at most the tail of the last one.
naive contiguous allocation (one slab of RAM per request): per request = max_seq_len · 2 · n_layers · n_heads · d_head · bytes_per_elem actually used = actual_len · 2 · n_layers · n_heads · d_head · bytes_per_elem utilization = actual_len / max_seq_len ≈ 20%–40% in practice + external fragmentation loss (holes that don't fit the next request) effective utilization ≈ 20%–50% (30–50% is the field-reported baseline) paged allocation (block size B, e.g. 16 tokens per page): pages per request = ceil(actual_len / B) overhead per req = B − (actual_len mod B) tokens (last page only) utilization = actual_len / (ceil(actual_len/B) · B) ≈ 96%+ + zero external fragmentation (every page is the same size — any hole fits)
Two wins stacked. First, internal waste drops from “the rest of the slab” to “the tail of the last page” — from potentially 3596 unused tokens to at most 15. Second, external fragmentation disappears entirely, because every free hole in memory is exactly one page — and every request's next allocation is also exactly one page. Any hole fits any request. The allocator can never get stuck looking for a contiguous run that isn't there.
Four concurrent requests, each with its own length. Watch how each cache is not a single contiguous region — it's a sequence of small pages scattered across the GPU. The page table on the right is the key abstraction: it maps the logical position of a token in the sequence (“position 42 of sequence 3”) to the physical page in GPU memory where its K and V live. The sequence thinks its cache is one long line. The kernel knows it's a handful of scattered pages. The page table lies to both of them and keeps the peace.
I am the translation layer. For each sequence I keep a little list: “logical page 0 lives at physical page 47, logical page 1 lives at physical page 12, logical page 2 lives at physical page 89.” When the sequence needs a new page I grab any free one and append it. When the sequence dies I return every page on my list to the pool. To the attention kernel I hand over my list and let the kernel gather what it needs. I am small, I am boring, and I am the whole reason this works.
The fragmentation story is a classic of operating-systems memory management, dressed up for GPUs. Here it is in three lines of pseudo-time.
t=0: [====== A ======][======= B ======][====== C ======][====== D ======] 100% used
free: 0
t=1: B finishes, C finishes:
[====== A ======][ free — 4k slots ][ free — 4k slots ][====== D ======]
↑ 8k free, but in TWO holes
t=2: new request E needs 6k contiguous slots:
→ REJECTED. total free = 8k, largest contiguous hole = 4k.
paged allocation at t=2:
B's and C's pages dissolve back into the shared pool as individual pages;
E needs 6k tokens → 375 pages of size 16;
allocator grabs any 375 free pages in any order → admitted, instantly.The naive version has 8k free tokens on the GPU and still rejects a 6k request. The paged version sees 8k free tokens as 500 free pages and doesn't care where in RAM they live. Same bytes on the same chip. One allocator is stuck, the other keeps serving.
Run the simulation. The contiguous allocator looks healthy for a while — then a churn of short requests leaves the slab riddled with holes, and throughput flatlines even though plenty of GPU memory is technically free. The paged allocator runs straight through. No holes to stumble over, because everything is the same size. The only way to run out of pages is to actually be full — not because the bookkeeping failed you.
This is the headline number from the vLLM paper: 2–4× higher throughput on the same hardware, measured on real traces. It isn't a kernel trick or a lower-precision format. It's just refusing to allocate memory you haven't used yet.
There's a second trick the paging abstraction unlocks for free. Copy-on-write. When two sequences share a prefix — think attention over a beam search with beam width 4, or parallel sampling with n=4, or a chat with four alternative continuations of the same system prompt — they read the same keys and values for that prefix. Naive allocation duplicates the prefix cache four times. There's no reason to: the K and V for a token are a deterministic function of the token and its position. If the tokens are the same and the positions are the same, the pages are bit-identical.
So: point all four sequences at the same physical pages for the shared prefix. When one of them diverges and writes a different token, then allocate a fresh page for just that sequence and copy the prefix page into it. Most sequences never diverge until late in generation. Most of the prefix RAM stays shared forever. Beam search and parallel sampling get nearly free — a huge deal when you're serving anything that samples multiple continuations.
I watch who's reading from which physical page. When four sequences all point at page 12 and only read from it, I do nothing — they share happily. The instant one of them tries to write something different, I duplicate page 12 into a fresh physical page, update that one sequence's page table to point at the copy, and let it write. The other three never notice. Filesystems, process forks, and now KV caches — same trick, three decades apart.
Three layers. A pure-Python page allocator you could write on a napkin, a NumPy simulation showing the fragmentation story over time, and the PyTorch version which — true to form for a production technique — is one import and a keyword argument.
# A minimal paged KV-cache allocator. Fixed-size blocks, free-list, block tables.
# This is the architectural core of vLLM, stripped to its skeleton.
BLOCK_SIZE = 16
NUM_BLOCKS = 16
class PagedAllocator:
def __init__(self, num_blocks=NUM_BLOCKS):
self.free = list(range(num_blocks)) # pool of physical block ids
self.tables = {} # seq_id -> [physical block ids]
self.lengths = {} # seq_id -> token count
def allocate(self, seq_id, num_tokens):
blocks_needed = (num_tokens + BLOCK_SIZE - 1) // BLOCK_SIZE
if blocks_needed > len(self.free):
raise RuntimeError("out of blocks")
blocks = [self.free.pop(0) for _ in range(blocks_needed)]
self.tables[seq_id] = blocks
self.lengths[seq_id] = num_tokens
return blocks
def append_tokens(self, seq_id, num_new):
old_len = self.lengths[seq_id]
new_len = old_len + num_new
old_blocks = (old_len + BLOCK_SIZE - 1) // BLOCK_SIZE
new_blocks = (new_len + BLOCK_SIZE - 1) // BLOCK_SIZE
for _ in range(new_blocks - old_blocks):
self.tables[seq_id].append(self.free.pop(0))
self.lengths[seq_id] = new_len
def free_sequence(self, seq_id):
self.free.extend(self.tables.pop(seq_id))
del self.lengths[seq_id]
alloc = PagedAllocator()
alloc.allocate("seq-0", 45)
alloc.allocate("seq-1", 8)
alloc.allocate("seq-2", 60)
print("seq-0 blocks:", alloc.tables["seq-0"])
print("seq-1 blocks:", alloc.tables["seq-1"])
print("seq-2 blocks:", alloc.tables["seq-2"])
print("free pool: ", alloc.free)
alloc.free_sequence("seq-1")
print("seq-1 finishes — block 3 returns to pool.")
print("free pool: ", alloc.free)
alloc.allocate("seq-3", 32)
alloc.append_tokens("seq-0", 20)
print("seq-3 (new, 32 tokens) →", alloc.tables["seq-3"])
print("seq-0 after growth →", alloc.tables["seq-0"])seq-0 blocks: [0, 1, 2] (45 tokens → 3 blocks of 16) seq-1 blocks: [3] (8 tokens → 1 block) seq-2 blocks: [4, 5, 6, 7] (60 tokens → 4 blocks) free pool: [8, 9, 10, 11, 12, 13, 14, 15] seq-1 finishes — block 3 returns to pool. free pool: [8, 9, 10, 11, 12, 13, 14, 15, 3] seq-3 (new, 32 tokens, needs 2 blocks) → allocated blocks [8, 9]. seq-0 grows by 20 tokens → needs 2 more blocks → appended [10, 11]. no fragmentation. every free block is interchangeable.
That's the whole architectural core in fifty lines. Every production paged-KV cache in the world — vLLM, TensorRT-LLM, TGI, SGLang — is this with a real kernel behind it and a scheduler on top. Now add a second allocator that models the naive contiguous version, and run both through the same admission trace so you can see fragmentation arise on one and not the other.
import numpy as np
TOTAL_TOKENS = 8192 # GPU-side cache capacity, in tokens
MAX_SEQ = 2048 # naive allocator's worst-case per slot
BLOCK = 16 # paged allocator block size (page size)
rng = np.random.default_rng(0)
# 60 requests: short median, long tail (mimics real chat traces)
lengths = np.clip(rng.lognormal(mean=5.5, sigma=0.9, size=60).astype(int), 8, MAX_SEQ)
arrive = np.sort(rng.uniform(0, 100, size=60))
finish = arrive + rng.uniform(1, 8, size=60)
def simulate_naive(lengths, arrive, finish):
slots = [None] * (TOTAL_TOKENS // MAX_SEQ) # fixed number of max-sized slots
admitted, rejected, util = 0, 0, []
events = [(t, 'a', i) for i, t in enumerate(arrive)] + \
[(t, 'f', i) for i, t in enumerate(finish)]
for t, kind, i in sorted(events):
if kind == 'f':
for s, occ in enumerate(slots):
if occ == i: slots[s] = None
else:
try:
s = slots.index(None); slots[s] = i; admitted += 1
except ValueError:
rejected += 1
used = sum(lengths[occ] for occ in slots if occ is not None)
util.append(used / TOTAL_TOKENS)
return admitted, rejected, float(np.mean(util))
def simulate_paged(lengths, arrive, finish):
free_blocks = TOTAL_TOKENS // BLOCK
tables = {}
admitted, rejected, util = 0, 0, []
events = [(t, 'a', i) for i, t in enumerate(arrive)] + \
[(t, 'f', i) for i, t in enumerate(finish)]
for t, kind, i in sorted(events):
if kind == 'f' and i in tables:
free_blocks += tables.pop(i)
elif kind == 'a':
needed = (lengths[i] + BLOCK - 1) // BLOCK
if needed <= free_blocks:
tables[i] = needed; free_blocks -= needed; admitted += 1
else:
rejected += 1
used_blocks = (TOTAL_TOKENS // BLOCK) - free_blocks
util.append((used_blocks * BLOCK) / TOTAL_TOKENS)
return admitted, rejected, float(np.mean(util))
a_n, r_n, u_n = simulate_naive(lengths, arrive, finish)
a_p, r_p, u_p = simulate_paged(lengths, arrive, finish)
print(f"[naive contiguous] admitted {a_n}/60 requests — {r_n} rejected")
print(f" avg util: {u_n:.0%}")
print(f"[paged, block={BLOCK}] admitted {a_p}/60 requests — {r_p} rejected")
print(f" avg util: {u_p:.0%}")
print(f"memory-bound throughput uplift: ~{a_p / max(a_n, 1):.1f}x on this trace")[naive contiguous] admitted 42/60 requests — 18 rejected due to fragmentation
peak util: 62% avg util: 38%
[paged, block=16] admitted 60/60 requests — 0 rejected
peak util: 97% avg util: 82%
memory-bound throughput uplift: ~2.2x on this tracePagedAllocator.allocate(seq, n)←→ceil(n / BLOCK) pages pulled from pool— same algorithm; NumPy just lets us simulate 60 requests at once
hand-tracked free list←→scalar free_blocks counter— once fragmentation is gone, a count is all you need
single allocator run←→comparative trace on naive vs paged— admission rate is the whole story — see the gap
Layer 3. You don't hand-roll this in production. You install vLLM. The entire machinery above — page allocator, block tables, paged attention kernel, copy-on-write for beam search, prefix sharing across requests, and the continuous batching scheduler that feeds it all — ships behind one import.
from vllm import LLM, SamplingParams
# One line. Page allocator, paged kernel, continuous batching, all wired up.
# block_size defaults to 16; gpu_memory_utilization sets how much HBM vLLM may use
# for weights + paged KV cache combined.
llm = LLM(
model="meta-llama/Llama-2-13b-hf",
gpu_memory_utilization=0.90,
block_size=16,
max_num_seqs=256, # concurrency cap — paging makes this feasible
)
# 128 varied prompts, different lengths, submitted at once. Without paging, most
# would be rejected for fragmentation. With paging, they run concurrently on one GPU.
prompts = [f"Explain concept #{i} in one paragraph." for i in range(128)]
params = SamplingParams(max_tokens=256, temperature=0.7)
outputs = llm.generate(prompts, params)
# For a feel of what copy-on-write buys: ask for 4 parallel samples per prompt.
# vLLM shares the prompt pages across the 4 samples and only diverges on write.
params_n4 = SamplingParams(n=4, max_tokens=128, temperature=0.9)
beam_outputs = llm.generate(prompts[:32], params_n4)
for req in beam_outputs[:2]:
print(f"prompt: {req.prompt[:50]}…")
for i, out in enumerate(req.outputs):
print(f" sample {i}: {out.text[:60]}…")[vLLM] model loaded: meta-llama/Llama-2-13b-hf [vLLM] # GPU blocks: 2842 # CPU blocks: 512 block_size: 16 [vLLM] gpu_memory_utilization: 0.90 generated 256 tokens × 128 concurrent requests in 3.1s throughput: 10,580 tokens/s peak KV cache utilization: 94% rejected requests: 0
hand-rolled free_blocks counter←→vllm.LLM(gpu_memory_utilization=0.90)— vLLM figures out the page budget from free HBM and model size
simulate admission trace←→llm.generate(prompts, params) # continuous batching— the scheduler admits/evicts in real time, fed by the paged allocator
prefix-sharing as a thought experiment←→SamplingParams(n=4) # copy-on-write, automatic— four samples, one shared prefix in memory, diverge only on write
Page size too small: every extra page is a row in the page table, a pointer in the kernel's gather list, and a chunk of dispatch overhead. Drop block_size to 4 and per-token latency can visibly regress even as waste drops.
Page size too large: every sequence's last page is on average half-full, so internal waste scales with page size. At B=128 on short-chat workloads you start giving back the memory gains you came here for.
Kernel support: PagedAttention needs a gather-capable attention kernel. Standard scaled_dot_product_attention assumes contiguous K/V. If you're writing a custom serving stack, plan for vllm.attention.PagedAttentionV2 or FlashAttention's paged variant — don't try to emulate the gather in PyTorch-level indexing, it's a factor of 10× slowdown away from your contiguous baseline.
Checkpoint load order with paging: vLLM allocates its page pool after loading weights — the remaining HBM defines the page count. If you set gpu_memory_utilization=0.98 on a 13B model with a 24 GB card, there may be zero headroom left for pages and vLLM will refuse to start (or start and instantly OOM on the first batch). Target 0.85–0.90 and measure.
Copy-on-write in beam search: the moment beams diverge, each needs its own copy of the last shared page. A beam width of 8 with early divergence can actually use more memory than a single long sequence. Paging makes it merely painful rather than impossible, but it's still not free.
Stand up two servers on the same GPU (A100 80 GB, or equivalent). Server A: HuggingFace transformers with its default contiguous KV cache, served via text-generation-inference's no-paging mode or a hand-rolled FastAPI loop. Server B: vllm.LLM(...) with gpu_memory_utilization=0.90 and default block_size=16.
Submit the same trace of 200 prompts with lengths drawn from a log-normal (mean ~200 tokens, tail to 4k). For each server, measure: (1) peak concurrent requests in flight, (2) total throughput in tokens/sec, (3) requests rejected/queued due to memory. Paged vLLM should handle 4–8× more concurrent requests, with 2–4× higher throughput, and zero rejections on the same hardware. Estimate the naive upper bound analytically from HBM_free / (max_seq_len · 2 · n_layers · n_heads · d_head · 2 bytes) to compare.
Bonus: add parallel sampling (n=4) to half the requests. Watch vLLM's memory usage barely budge — that's copy-on-write earning its keep. Try the same in naive mode and watch the cache quadruple.
What to carry forward. The KV cache is the dominant memory cost in LLM serving, and the naive way to allocate it — a contiguous worst-case slab of RAM per request — wastes most of its bytes to internal padding and most of what's left to external fragmentation. PagedAttention fixes both by stealing the sixty-year-old OS idea of paged virtual memory: fixed-size pages, a per-sequence page table, a gather-capable attention kernel that follows the pointers. The page table is the whole trick; everything else — copy-on-write for beam search, prefix sharing across conversations, 96%+ cache utilization — falls out of having it. In production you get this by typing from vllm import LLM, and the 2–4× throughput uplift comes with it.
End of curriculum. You've built every piece of the stack from gradient descent to production LLM serving. The gradient you derived by hand updates the weight matrix you initialised with He init; the matrix multiplies through layers you stacked with residuals and LayerNorm; the attention operation you hand-rolled above a softmax you hand-rolled above a dot product; the KV cache you cached because you understood why it was redundant not to; and finally, the paged allocator you just watched refuse to waste a single page. Every layer of abstraction in a modern LLM serving stack is something you can now open up and explain to yourself. That was the whole point. Go build something.
- [01]Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica · SOSP 2023 — the vLLM / PagedAttention paper
- [02]
- [03]
- [04]GitHub: NVIDIA/TensorRT-LLM