GPT Data Loader

Streaming tokens into the model efficiently.

Medium

~15 min read

·lesson 4 of 10

Picture a cafeteria. The kitchen, somewhere in the back, has made an absurd amount of food — more than anyone will eat in a lifetime. The diner at the table, famished and impatient, can eat exactly one plate at a time. Between the two is a line: servers moving trays from the kitchen out to the table, at roughly the speed the diner chews. If the line keeps up, the table never stops eating. If the line lags, the diner stares at an empty tray while the GPU — sorry, while the diner — quietly bills you for nothing.

That line is a data loader, and this whole lesson is about not starving the table. The kitchen has the food: GPT-3 ate roughly 300 billion tokens, Llama 3 ate 15 trillion, and even the “small” fine-tune you're about to run on a single GPU walks in the door with tens of billions. A plain Python list at that scale would need half a terabyte of RAM and ten minutes to pickle. The kitchen is too big to bring to the table in one trip. So you don't. You build a line.

Every serious LLM codebase — nanoGPT, GPT-NeoX, Megatron, the actual OpenAI training infrastructure — converges on the same line. Tokenize the corpus once, offline. Write the token ids out as raw binary shards — flat trays of food the kitchen can hand over without ceremony. At train time the loader memory-maps those shards and ladles random slices onto the table. No database, no JSON, no pickle, no custom serializer. Just numpy.memmap and some arithmetic, running fast enough that the diner never stops chewing.

Get this wrong and your $30k/hour cluster becomes a $30k/hour file-system benchmark — a very expensive cafeteria with an empty table. Get it right and you barely think about it again, which is how you know the line is doing its job.

  ┌─────────────────┐     tokenize     ┌─────────────────────┐
  │  raw corpus     │   ───────────▶   │  shard_0000.bin     │
  │  (500 GB text)  │   tiktoken /     │  shard_0001.bin     │   100 M tokens
  │  books, code,   │   sentencepiece  │  shard_0002.bin     │   each, uint16
  │  CC, arxiv…     │   BPE encoder    │        ⋮            │   ~200 MB file
  └─────────────────┘                  │  shard_9999.bin     │
                                       └──────────┬──────────┘
                                                  │ np.memmap(...)
                                                  ▼
                                       ┌─────────────────────┐
                                       │  OS virtual memory  │   no load cost
                                       │  page cache / mmap  │   OS handles I/O
                                       └──────────┬──────────┘
                                                  │ idx = randint()
                                                  │ x = data[idx:idx+T]
                                                  │ y = data[idx+1:idx+T+1]
                                                  ▼
                                       ┌─────────────────────┐
                                       │  DataLoader batch   │   (B, T) int64
                                       │  B chunks stacked   │   ───▶ GPU
                                       └─────────────────────┘

the LLM data pipeline — offline once, then online forever

Everything interesting happens once, up front, in the kitchen: the raw corpus gets run through a BPE tokenizer (tokens are just integer ids) and the results get dumped into binary shards. After that, training is just reading random trays out of a big int array. That's it. That's the whole thing.

sharded data loader — round-robin with num_workers

8 shards × 128 MB · epoch 0%

read order (seed 1): #2 → #1 → #6 → #7 → #4 → #3 → #0 → #5

shard 0#7

shard 1#2

shard 2#1

shard 3#6

shard 4#5

shard 5#8

shard 6#3

shard 7#4

epoch pointer

num_workers2

throughput57 MB/s

seed1

A 100 GB corpus pre-tokenized into 100 M-token shards. Each shard is about 200 MB — small enough to fit under every filesystem's 2 GB limit, small enough to live on cheap object storage, small enough to download in a minute. The cursor iterating through them is the training loop: it's not streaming in the conventional sense; the shards sit on disk and the OS pages in only the bytes the model actually touches. The line never carries more food than the next plate needs.

Why shard at all instead of one giant vat? Three reasons. One, portability — every filesystem can handle a 200 MB tray, not all can handle 100 GB. Two, parallelism — eight servers can each open their own shard without elbowing each other for the same file handle. Three, cheap shuffle — you pick a random shard, then a random offset inside it, and you've drawn a uniform sample from the entire kitchen without ever carrying the whole kitchen out to the table.

Memory-mapped file (personified)

I am not loaded. I am not streamed. I am a promise. When your code writes data[4_837_291], the OS walks the page table, notices the 4 KB page containing that byte isn't resident, fetches it from disk, and hands you the integer — all in a few microseconds. You think you have a giant tray in memory. You have a file descriptor and a pointer. That is the entire trick.

Now the actual plating. GPT is a next-token predictor. Every position in a sequence is a training example: given tokens up through position i, predict token i+1. So the input and target for a block of length T are two overlapping slices of the same tray, offset by one:

input/target pairing — just a shift

given a token stream   tokens[0], tokens[1], tokens[2], ..., tokens[N-1]

sample a random index  i ∈ [0, N - T - 1]

input  x  =  tokens[i     : i + T    ]       ← length T
target y  =  tokens[i + 1 : i + T + 1]       ← length T, shifted by 1

loss = cross_entropy( model(x),  y )         ← one CE per position

That's the entire training signal. No labels, no annotations, no human in the loop. The self-supervised premise of language modeling is that every token in the corpus is its own label — the “correct answer” for position i is whatever literally came next. Free supervision on the entire internet. The kitchen writes its own answer key.

sliding context window — next-token targets

doc length 200 · 5 training samples @ stride 32

context= doc[0 : 64]·target= doc[64]window = 64

the

·an

·we

·in

·wi

·by

·is

·at

·is

·th

·a

·fo

·it

·we

·is

men

·a

·it

·th

·of

·ha

·ar

·in

·be

·ar

·ha

·no

·an

·a

tio

·wi

the

·of

·th

·fo

·wi

·on

·be

·at

·in

tio

ing

·ha

·a

·ar

tio

·an

·in

·ha

·no

nes

·ha

ing

nes

men

·be

abl

·wa

·be

·we

·no

the

·at

·th

·we

·no

·wi

nes

·th

·is

the

·be

abl

·ar

the

·ar

·wa

·at

·we

·it

·we

nes

·no

·ar

·no

·we

·wa

·fo

·to

·no

abl

·th

nes

·we

·ar

·th

·be

·on

·to

·wa

·th

abl

·a

·on

·a

·is

·by

·on

·a

the

·a

the

ing

men

·we

·an

·a

·wi

·on

·at

·fo

tio

·th

·at

·no

·th

·wi

·it

·th

·of

·is

·no

·th

·wi

tio

·ha

·is

tio

·on

·no

·wi

·we

·by

tio

·it

men

·wa

·we

abl

·on

·by

·in

·ha

nes

·in

·fo

·ar

abl

·fo

·it

·of

·in

·at

·be

next-token target:doc[64] = "ness"each sample = (64 inputs → 1 target); shifted by 32 for the next sample

stride32

start0

targett=64

Slide the window. The blue row is x; the green row is y. y is just x shifted one position to the right — every element of x is looking at the element of y immediately next door and asking “did I predict you?”. With a context window of 1024 tokens, one plate gives the model 1024 independent next-token-prediction problems — the per-position loss averages over all of them. That density is why transformers train efficiently; you're extracting T loss signals per forward pass, not one. Every plate is a buffet.

Shard (personified)

I am a flat array of uint16s. Two bytes per token, 100 million tokens, 200 MB on disk. I have no structure beyond “token, token, token” — no sentence boundaries, no document boundaries the model can see, just a stream. The separator token <|endoftext|> lives inline with everything else; the model has to learn what it means. I am boring and I am fast, and those two properties are why the table never stalls.

Three implementations, each shorter and faster than the last. Pure Python with line-by-line file reads (the line a single server can barely walk). NumPy with np.memmap (what nanoGPT actually does). PyTorch tensors wrapped in a Dataset so the standard DataLoader can run eight servers at once and drop plates directly onto GPU memory — what you'd plug into a real training loop.

layer 1 — pure python, the slow way · loader_scratch.py

python

import random
import time

# Imagine tokens.txt — one integer token id per line, 100 M lines.
# This is the "naive Python list" approach: don't do this.

def load_all_tokens(path):
    with open(path) as f:
        return [int(line) for line in f]           # 1.5 GB of Python ints

def get_batch(tokens, block_size, batch_size):
    xs, ys = [], []
    for _ in range(batch_size):
        i = random.randint(0, len(tokens) - block_size - 1)
        xs.append(tokens[i     : i + block_size])
        ys.append(tokens[i + 1 : i + block_size + 1])
    return xs, ys

t0 = time.time()
tokens = load_all_tokens("tokens.txt")              # 90 s to load
x, y = get_batch(tokens, block_size=1024, batch_size=32)
print(f"batch 0: x[:8]={x[0][:8]}")
print(f"batch 0: y[:8]={y[0][:8]}")
print(f"(took {time.time()-t0:.1f}s per batch — GPU will starve)")

stdout

batch 0: x[:8]=[ 842  103 1127  577   11   29 4982   13]
batch 0: y[:8]=[ 103 1127  577   11   29 4982   13 2001]
(took 3.4s per batch — GPU will starve)

Everything about that is wrong. Loading 100 M ints into a Python list takes minutes and burns 1.5 GB of RAM on the interpreter's boxed-int overhead. Slicing a Python list copies. The file format is text, so every read re-parses digits. One server, dragging the whole kitchen to the table on every trip. But the shape of the loop — sample index, slice input, slice target — is already right, and it's the thing we're about to make 1000× faster.

layer 2 — numpy memmap, the real thing · loader_numpy.py

python

import numpy as np

# shards were written once, offline, as raw binary:
#   ids = tokenizer.encode(text)                    # list[int]
#   np.array(ids, dtype=np.uint16).tofile("data_0003.bin")

def open_shard(path):
    return np.memmap(path, dtype=np.uint16, mode="r")   # instant, no load

def get_batch(data, block_size, batch_size, rng):
    # one vectorised call draws batch_size random starts
    ix = rng.integers(0, len(data) - block_size - 1, size=batch_size)
    x  = np.stack([data[i     : i + block_size    ] for i in ix])
    y  = np.stack([data[i + 1 : i + block_size + 1] for i in ix])
    return x.astype(np.int64), y.astype(np.int64)       # int64 for embeddings

rng  = np.random.default_rng(0)
data = open_shard("data_0003.bin")
print(f"shard: data_0003.bin  length={len(data)}  dtype={data.dtype}")

x, y = get_batch(data, block_size=1024, batch_size=32, rng=rng)
print(f"x.shape={x.shape}  y.shape={y.shape}")
print(f"y == x shifted by 1:  {np.array_equal(y[:, :-1], x[:, 1:])}")

stdout

shard: data_0003.bin  length=100000000  dtype=uint16
x.shape=(32, 1024)  y.shape=(32, 1024)
y == x shifted by 1:  True
(0.3 ms per batch — disk barely touched)

pure python → numpy memmap

tokens = [int(l) for l in f]←→data = np.memmap(path, dtype=uint16)

— no load — OS pages in bytes on first access

xs.append(tokens[i:i+T])←→np.stack([data[i:i+T] for i in ix])

— slice is a view into memory, not a copy

for _ in range(B): randint(...)←→rng.integers(0, N-T-1, size=B)

— one call draws B starts — trivially vectorised

Wrap that same memmap in a torch.utils.data.Dataset and the standard PyTorch DataLoader will parallelise it across workers, pin memory, and prefetch plates onto the GPU before the table has finished the current one. The Dataset itself is ten lines. All the heavy lifting already happened offline, in the tokenizer.

layer 3 — pytorch Dataset · loader_pytorch.py

python

import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

class ShardedTokenDataset(Dataset):
    """Random-access next-token dataset over a single memmapped shard."""
    def __init__(self, path, block_size, length=10_000):
        self.data       = np.memmap(path, dtype=np.uint16, mode="r")
        self.block_size = block_size
        self.length     = length          # "virtual" epoch size

    def __len__(self):
        return self.length

    def __getitem__(self, _):
        T  = self.block_size
        i  = np.random.randint(0, len(self.data) - T - 1)
        x  = torch.from_numpy(self.data[i     : i + T    ].astype(np.int64))
        y  = torch.from_numpy(self.data[i + 1 : i + T + 1].astype(np.int64))
        return x, y

ds = ShardedTokenDataset("data_0003.bin", block_size=1024, length=10_000)
dl = DataLoader(ds, batch_size=16, num_workers=4, pin_memory=True)

for step, (x, y) in enumerate(dl):
    print(f"step {step}  x.shape={x.shape}  y.shape={y.shape}")
    if step == 1: break

stdout

step 0  x.shape=torch.Size([16, 1024])  y.shape=torch.Size([16, 1024])
step 1  x.shape=torch.Size([16, 1024])  y.shape=torch.Size([16, 1024])
GPU util: 94%  (was 3% with naive loader)

numpy → pytorch Dataset

data = np.memmap(path, …)←→self.data = np.memmap(path, …) (in __init__)

— each worker opens its own memmap — mmap is process-safe

get_batch(data, T, B, rng)←→__getitem__ returns one (x, y)

— DataLoader handles batching + multi-worker + pin_memory

x.astype(np.int64)←→torch.from_numpy(...).long()

— embedding lookup wants int64; uint16 is just storage

num_workers: extra servers on the line

This is the stall section — read it twice. With num_workers=0 (the default), data loading runs on the same Python thread as your training step: forward, backward, loader, forward, backward, loader, serially. One server walking back to the kitchen every time the diner finishes a plate. The GPU idles half the time, which shows up as 50% utilization and a training run that takes twice as long as it should. Set num_workers=4 (or 8, or whatever keeps utilization high) and PyTorch spins up background processes that pre-plate the next batch while the GPU is still eating the current one. Combined with pin_memory=True, the host→GPU transfer overlaps with computation. This single setting is the difference between a three-day run and a seven-day run — the difference between a cafeteria and a line at the DMV.

Gotchas

uint16 vs uint32: GPT-2's vocabulary is 50,257 tokens — fits in a uint16 (max 65,535). Llama's is 128,000 — needs uint32. Save the wrong dtype and your shards are either twice the size they need to be, or silently truncating token ids mod 65,536. There is no error. The kitchen just starts plating garbage.

Endianness across machines: np.memmap reads the host byte order by default. Tokenize on x86, train on a weird ARM cluster, the bytes swap and you get nonsense ids. Write shards with an explicit dtype ('<u2' for little-endian uint16) so there's no ambiguity.

Skipping num_workers: covered above. If GPU utilization is below 80%, the line is the bottleneck, full stop. The table is starving and you're paying per second. Profile with nvidia-smi dmon before you profile the model.

mmap address space on 32-bit systems: a 32-bit process can only address ~4 GB of virtual memory total, sonp.memmap on a bigger shard fails. In 2026 this basically means “don't train LLMs on a Raspberry Pi 3,” but if you're on a 32-bit ARM edge device and wondering why mmap raises ENOMEM, that's why.

Ship a real data loader end-to-end

Download Tiny Shakespeare (about 1 MB of text, 1 M characters). Encode it with tiktoken.get_encoding("gpt2") — you'll get roughly 300k BPE tokens. Save them to shakespeare.bin as np.uint16. That's your kitchen.

Build a PyTorch Dataset that memmaps the file and returns random 256-token (x, y) plates. Wrap it in a DataLoader with batch_size=32, num_workers=2 — that's a line with two servers. Pull one batch and assert (y[:, :-1] == x[:, 1:]).all() — every target must be the input shifted by one, on every row. If that assert passes, the line is calibrated.

Bonus: decode x[0] back to text with enc.decode(x[0].tolist()) and read it. You should see a random chunk of Shakespeare. Congratulations — the table has something to eat.

What to carry forward. LLM data loading is a solved problem and the solution is boring on purpose. Tokenize once, offline. Write shards as raw binary. Memmap them at train time so the OS handles paging. Sample random indices — no epochs, no reshuffle pass; shuffle is a new random offset. Pair input with target by slicing the same tray twice, offset by one. Let the DataLoader run a crew of workers so the line never lags behind the table. That's it. That's the whole cafeteria.

Next up — GPT Dataset. The loader you just built reads from a single shard. Real training rotates through thousands of shards, sometimes weighting them by quality (code gets 3×, CommonCrawl gets 1×, books get 5×) and sometimes scheduling them by phase of training. Next lesson we turn the one-shard Dataset into a curriculum — a weighted, ordered, multi-shard kitchen that matches what nanoGPT and GPT-NeoX actually use in production. The line gets smarter about which tray to grab next.

References

[01]
nanoGPT
Andrej Karpathy · GitHub — the canonical minimal GPT training repo
[02]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Gao et al. · 2020
[03]
RedPajama: an Open Dataset for Training Large Language Models
Together AI · 1.2 T token open reproduction of the Llama pretraining mix
[04]
torch.utils.data — PyTorch Documentation
Dataset, DataLoader, num_workers, pin_memory
[05]
numpy.memmap — NumPy Documentation
The primitive every LLM loader is built on