A working GPT, built lesson by lesson
BPE from scratch — merges, vocabs, edge cases.
Train a BPE vocab on real text.
Whitespace, unicode, emoji, and the quirks of tiktoken.
Streaming tokens into the model efficiently.
Context windows, next-token targets, packing.
Assemble the full GPT architecture.
AdamW, warmup, cosine decay — the real recipe.
Sampling: temperature, top-k, nucleus.
The single trick behind fast inference.
Llama-style attention: memory savings without accuracy loss.