Code GPT

Assemble the full GPT architecture.

Hard

~15 min read

·lesson 6 of 10

Every other lesson in this section carved one disc in a stack of vinyl. The tokenizer pressed strings into integer grooves. Token and positional embeddings etched position onto the disc. Self-attention let tokens whisper across bars. The feed-forward block let each token hum alone. LayerNorm kept the mix level; residuals kept the amplifier alive. None of those records plays a song on its own. They are a stack of pressings sitting on a shelf.

This lesson is the jukebox. Not a new trick — an assembly. We take the pressings you already have, load them into the cabinet in the right order, wire the needle to the output, and watch a silent stack of vinyl turn into a machine that plays a new song every time you drop a coin in. By the end of this page the cabinet you've built is byte-compatible with the exact vinyl OpenAI pressed in 2019. You will drop the needle on GPT-2.

By the end you'll have (a) a top-down picture of every tensor a prompt travels through, (b) a NumPy forward pass through a two-layer toy GPT, (c) a complete PyTorch GPT class in the style of Karpathy's nanoGPT, and (d) loader code that pulls GPT-2's pretrained weights off HuggingFace and spins them on the same turntable. Same twelve-block cabinet, just bigger numbers on the record label.

Before any code, stare at the whole jukebox. A GPT has exactly four parts: an embedding layer, a stack of identical transformer blocks, a final layer norm, and a linear head that projects back to the vocabulary. That's the entire cabinet. Everything people call “a GPT” — GPT-2, GPT-3, GPT-4, Llama, Mistral — differs only in widths, depths, and a few small surgeries to the mechanism.

   token ids           (B, T)
        │
        ▼
  ┌───────────────┐       ┌──────────────────┐
  │ token_emb W_te│──────▶│    + pos_emb     │   (B, T, d_model)
  └───────────────┘       └──────────────────┘
        │
        ▼
  ┌────────────────────────────────────────────┐
  │  TRANSFORMER BLOCK   ×  N_LAYER            │
  │  ┌──────────────────────────────────────┐  │
  │  │  x  +=  MultiHeadAttention(LN1(x))   │  │   (B, T, d_model)
  │  │  x  +=  FeedForward     (LN2(x))     │  │
  │  └──────────────────────────────────────┘  │
  └────────────────────────────────────────────┘
        │
        ▼
    ┌───────┐
    │  LN_f │                          (B, T, d_model)
    └───────┘
        │
        ▼
   ┌─────────────┐
   │   lm_head   │   ← tied to W_te    (B, T, vocab)
   └─────────────┘
        │
        ▼
      logits

GPT, end to end — (B, T) in, (B, T, vocab) out

That's the whole jukebox. The embedding layer looks up a vector per token and adds a vector per position — the first groove on the record. The transformer block — attention, feed-forward, each wrapped in LayerNorm and a residual — runs N times, one pass of the needle per layer. The final LayerNorm cleans up the signal. The language-model head projects d_model back to vocab_size logits. Run softmax along the vocab axis and you have a probability over the next token. Train by minimizing cross-entropy between those logits and the real next token. Generate by sampling one and feeding it back in. A song, one note at a time.

forward pass, with shapes

idx      : (B, T)              token indices
tok_emb  = W_te[idx]            → (B, T, d)
pos_emb  = W_pe[0:T]            → (T, d)           broadcast over batch
x        = tok_emb + pos_emb    → (B, T, d)
for l in 1..N:
    x = x + Attn(LN1(x))        → (B, T, d)
    x = x + MLP (LN2(x))        → (B, T, d)
x        = LN_f(x)              → (B, T, d)
logits   = x @ W_te.T           → (B, T, V)
loss     = CE(logits, targets)  → scalar

Every shape matters. B is batch size, T is sequence length, d is d_model, V is vocab. The batch and time axes are passive — attention mixes information across T, everything else is pointwise. The d axis is where representation lives. The final matmul x @ W_te.T is the interesting trick: the output projection uses the same matrix as the input embedding, just transposed. One record played through both the input groove and the output needle. That's weight tying, and we'll get to it.

GPT top-down architecture — click any block to inspect

d_model=768 · n_heads=12 · vocab=50257 · ctx=1024

inspector

12 × transformer block68.3% of model

each block: multi-head attention (4d² + 4d weights) + MLP (8d² + 5d weights) + 2 layer norms. 12 stacked, residual-and-norm between. this is where the thinking happens.

params: 85.1Mcompute: ~170.1M FLOPs/token

param breakdown (current N = 12)

N blocks12

total params124.4M

per-token compute248.9M

Play with the widget. Click any block to see the exact shape flowing through it, and watch the total parameter count scale as you crank n_layer and d_model. A few reference configs to lodge in memory — think of them as the A-side singles at the top of the chart:

GPT-2 small: n_layer=12, n_head=12, d_model=768, block_size=1024, vocab=50257 → ~124M params.
GPT-2 medium: n_layer=24, d_model=1024 → 355M.
GPT-2 XL: n_layer=48, d_model=1600 → 1.5B.
GPT-3: n_layer=96, d_model=12288 → 175B. Same cabinet, 1400× the groove density.

There is no magic in the jump from 124M to 175B. Same four parts, wider discs, deeper stack, more hours of tape fed to the pressing plant.

Positional embedding (personified)

The token embedding tells the model what each word is. I tell it where each word sits on the record. Without me, the dog bit the man and the man bit the dog are the same unordered bag of vectors — the DJ can't tell which groove comes first. I'm a learned vector per position, added to each token before the transformer drops the needle. Attention is permutation- equivariant; I'm the only reason word order exists.

Now the one piece of model surgery that will surprise you if you've only ever read the architecture diagram. The input embedding matrix W_te has shape (V, d)— one row per vocabulary token. The output projection lm_head has shape (d, V) — one column per vocabulary token. Those are the same numbers, transposed. So: use the same parameters. Literally bind them together; update one and the other moves with it. One record, played both when the needle reads in and when it writes out. This is weight tying.

weight tying — the same record, played twice

in:   embed(idx)  = W_te[idx]           (V, d) lookup
out:  logits     = x  @  W_te.T        (d, V) projection

    → lm_head.weight  is  token_emb.weight   (Python: one shared tensor)

parameter savings:  V · d   matrix counted once instead of twice
for GPT-2 small (V=50257, d=768)  →  ~39M parameters saved

The intuition: the embedding row for token cat is the vector “this is the word cat”. The lm_head column for token cat is the vector “predict cat when you see this”. Those should be the same thing — the notion of cat doesn't change between reading the song and writing it. Empirically, tying weights improves quality for a given parameter budget, and it saves a chunk of memory. Press & Wolf 2017 showed this is a nearly free win; every modern LM does it.

where the parameters live — per-bucket breakdown

GPT-2 small · d=768 · L=12 · V=50.3k

total parameter stack (click a preset to re-scale)

32%

17%

23%

token + pos embedding◀ dominant

39.4M31.6%

attention Q/K/V

21.3M17.1%

attention out proj

7.1M5.7%

MLP up (d → 4d)

28.3M22.8%

MLP down (4d → d)

28.3M22.8%

layer norms

38.4k0.0%

lm_head (tied)

00.0%

scale intuition

attention Q/K/V: 3 · L · d²

MLP up + down: 8 · L · d²

embedding: V · d

per-block d² terms = 12 · d². embedding is linear in d. the crossover: L · 12d² > V · d at d > V / (12L).

at d=768 · L=12 · V=50.3k: crossover d = 349. you are MLP-dominated.

known checkpoints

d_model

n_layers

vocab

total124.4M

dominanttoken + pos embedding

Look at where the parameters actually live. Most guides draw the transformer with attention as the star — attention gets the press, the complexity, the papers named after it. But in a GPT-2 small, more than 60% of the parameters live in the feed-forward blocks. The FFN's two projections each have shape (d, 4d); that's 8 · d² parameters per block, versus 4 · d² for attention's Q, K, V, O. The embedding, despite being one layer, is a huge slab because V · d is big when the vocab is 50257. The final lm_head would have been another slab — if you didn't tie weights. That's the ~40M you saved by pressing a double-sided record instead of two.

Tied lm_head (personified)

I'm not really a layer. I'm the embedding record, flipped to its B-side on the way out. You spent V·d parameters teaching me what each token looks like. When it comes time to predict the next token, why would you press a second, independent V·d disc to do the inverse? Same vocabulary, same semantic space. Just transpose me and matmul.

Time to build the whole jukebox. Three passes: a stripped-down NumPy forward-only GPT to prove the shapes match the diagram, a real PyTorch class in ~120 lines that can train and generate, and the loader that pulls GPT-2's pretrained weights from HuggingFace and drops them into your class. Each pass is shorter than the last, and each one plays the same song.

layer 1 — numpy · tiny_gpt.py (forward-only, n_layer=2, d=32)

python

import numpy as np

def layer_norm(x, g, b, eps=1e-5):
    mu = x.mean(-1, keepdims=True)
    var = x.var(-1, keepdims=True)
    return g * (x - mu) / np.sqrt(var + eps) + b

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

def attention(x, qkv_w, qkv_b, proj_w, proj_b, n_head):
    B, T, d = x.shape
    qkv = x @ qkv_w + qkv_b                             # (B, T, 3d)
    q, k, v = np.split(qkv, 3, axis=-1)
    # split heads: (B, T, d) → (B, n_head, T, d_head)
    def split(z): return z.reshape(B, T, n_head, d // n_head).transpose(0, 2, 1, 3)
    q, k, v = split(q), split(k), split(v)
    att = q @ k.transpose(0, 1, 3, 2) / np.sqrt(d // n_head)          # (B, h, T, T)
    mask = np.triu(np.full((T, T), -np.inf), k=1)        # causal mask
    att = softmax(att + mask, axis=-1)
    out = (att @ v).transpose(0, 2, 1, 3).reshape(B, T, d)
    return out @ proj_w + proj_b

def ffn(x, w1, b1, w2, b2):
    h = np.maximum(0, x @ w1 + b1)                       # ReLU (real GPT-2 uses GELU)
    return h @ w2 + b2

def gpt_forward(idx, params, cfg):
    B, T = idx.shape
    d, V, n_layer, n_head = cfg['d'], cfg['V'], cfg['n_layer'], cfg['n_head']

    x = params['wte'][idx] + params['wpe'][:T]           # (B, T, d)

    for l in range(n_layer):
        b = params['blocks'][l]
        x = x + attention(layer_norm(x, b['ln1_g'], b['ln1_b']),
                          b['qkv_w'], b['qkv_b'], b['proj_w'], b['proj_b'], n_head)
        x = x + ffn(layer_norm(x, b['ln2_g'], b['ln2_b']),
                    b['mlp1_w'], b['mlp1_b'], b['mlp2_w'], b['mlp2_b'])

    x = layer_norm(x, params['lnf_g'], params['lnf_b'])
    logits = x @ params['wte'].T                          # TIED WEIGHTS: reuse wte
    return logits                                         # (B, T, V)

# Tiny random init just to prove shapes work
rng = np.random.default_rng(0)
cfg = dict(d=32, V=100, n_layer=2, n_head=4, T_max=16)
def rnd(*s): return rng.normal(0, 0.02, size=s)
def zero(*s): return np.zeros(s)

params = dict(
    wte=rnd(cfg['V'], cfg['d']),
    wpe=rnd(cfg['T_max'], cfg['d']),
    lnf_g=np.ones(cfg['d']), lnf_b=zero(cfg['d']),
    blocks=[dict(
        ln1_g=np.ones(cfg['d']), ln1_b=zero(cfg['d']),
        qkv_w=rnd(cfg['d'], 3 * cfg['d']), qkv_b=zero(3 * cfg['d']),
        proj_w=rnd(cfg['d'], cfg['d']), proj_b=zero(cfg['d']),
        ln2_g=np.ones(cfg['d']), ln2_b=zero(cfg['d']),
        mlp1_w=rnd(cfg['d'], 4 * cfg['d']), mlp1_b=zero(4 * cfg['d']),
        mlp2_w=rnd(4 * cfg['d'], cfg['d']), mlp2_b=zero(cfg['d']),
    ) for _ in range(cfg['n_layer'])]
)

idx = rng.integers(0, cfg['V'], size=(2, 8))              # batch=2, T=8
logits = gpt_forward(idx, params, cfg)
print("logits shape:", logits.shape)                       # -> (2, 8, 100)
print("sum of abs logits:", np.abs(logits).sum().round(2))

Every shape in that file came from the diagram. idx in, logits out, N transformer blocks in between. Swap np.maximum(0, ...) for a real GELU, add dropout, add a loss, hook up gradients, and you have PyTorch's GPT. The turntable spins, but the record is still blank vinyl — random init, no song yet. Which is exactly what comes next.

layer 2 — pytorch · gpt.py (full, trainable, ~120 lines, after nanoGPT)

python

import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50257
    n_layer:    int = 12
    n_head:     int = 12
    d_model:    int = 768
    dropout:    float = 0.0


class CausalSelfAttention(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        assert cfg.d_model % cfg.n_head == 0
        self.c_attn = nn.Linear(cfg.d_model, 3 * cfg.d_model, bias=True)   # Q, K, V packed
        self.c_proj = nn.Linear(cfg.d_model, cfg.d_model, bias=True)
        self.attn_drop = nn.Dropout(cfg.dropout)
        self.resid_drop = nn.Dropout(cfg.dropout)
        self.n_head, self.d_model = cfg.n_head, cfg.d_model
        self.register_buffer(
            "mask", torch.tril(torch.ones(cfg.block_size, cfg.block_size))
                     .view(1, 1, cfg.block_size, cfg.block_size)
        )

    def forward(self, x):
        B, T, d = x.shape
        q, k, v = self.c_attn(x).split(d, dim=2)
        # (B, T, d) → (B, n_head, T, d_head)
        q = q.view(B, T, self.n_head, d // self.n_head).transpose(1, 2)
        k = k.view(B, T, self.n_head, d // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, d // self.n_head).transpose(1, 2)
        att = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
        att = F.softmax(att, dim=-1)
        att = self.attn_drop(att)
        y = (att @ v).transpose(1, 2).contiguous().view(B, T, d)
        return self.resid_drop(self.c_proj(y))


class MLP(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.c_fc   = nn.Linear(cfg.d_model, 4 * cfg.d_model)
        self.c_proj = nn.Linear(4 * cfg.d_model, cfg.d_model)
        self.drop   = nn.Dropout(cfg.dropout)

    def forward(self, x):
        return self.drop(self.c_proj(F.gelu(self.c_fc(x))))


class Block(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.ln1 = nn.LayerNorm(cfg.d_model)
        self.attn = CausalSelfAttention(cfg)
        self.ln2 = nn.LayerNorm(cfg.d_model)
        self.mlp = MLP(cfg)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))          # pre-norm residual
        x = x + self.mlp(self.ln2(x))
        return x


class GPT(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.d_model)
        self.pos_emb = nn.Embedding(cfg.block_size, cfg.d_model)
        self.drop    = nn.Dropout(cfg.dropout)
        self.blocks  = nn.ModuleList(Block(cfg) for _ in range(cfg.n_layer))
        self.ln_f    = nn.LayerNorm(cfg.d_model)
        self.lm_head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False)

        # WEIGHT TYING
        self.lm_head.weight = self.tok_emb.weight

        # GPT-2 style init — scaled for residual-path stability
        self.apply(self._init_weights)
        for p_name, p in self.named_parameters():
            if p_name.endswith("c_proj.weight"):
                nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * cfg.n_layer))

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.normal_(m.weight, std=0.02)
            if m.bias is not None:
                nn.init.zeros_(m.bias)
        elif isinstance(m, nn.Embedding):
            nn.init.normal_(m.weight, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        assert T <= self.cfg.block_size
        pos = torch.arange(T, device=idx.device)
        x = self.drop(self.tok_emb(idx) + self.pos_emb(pos))    # (B, T, d)
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)                                # (B, T, V)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.cfg.block_size:]            # crop to block_size
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature              # last token only
            if top_k is not None:
                v, _ = torch.topk(logits, top_k)
                logits[logits < v[:, [-1]]] = -float("inf")
            probs = F.softmax(logits, dim=-1)
            next_id = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_id), dim=1)
        return idx


# Smoke test
cfg = GPTConfig(block_size=64, vocab_size=100, n_layer=2, n_head=4, d_model=32)
model = GPT(cfg)
idx = torch.randint(0, cfg.vocab_size, (2, 8))
logits, loss = model(idx, targets=idx)
print(f"logits: {list(logits.shape)}  loss: {loss.item():.3f}")
print(f"params: {sum(p.numel() for p in model.parameters()):,}")

Read it. Count the lines — ~120, including the generate loop and weight init. This is the entire jukebox cabinet. Not a simplification, not a toy: the same class, with bigger config numbers, is what OpenAI pressed into the GPT-2 record. Karpathy released essentially this code as nanoGPT in 2022; it's been the reference implementation ever since.

One detail worth flagging in the generate method: the three knobs on the front panel of the jukebox are temperature, top_k, and the implicit “sample from the softmax.” Temperature is how drunk the DJ is — crank it up and the needle picks stranger records; set it near zero and the DJ puts the same hit song on repeat. top_k is what records the DJ is even allowed to consider — the rest of the jukebox is locked. top_k=1 is greedy decoding, the most popular song on repeat. We'll unpack all of that in the sampling lesson; for now, note that the song changes every time you press play because multinomial is rolling dice.

layer 3 — pytorch · load_pretrained_gpt2.py

python

# Load OpenAI's pretrained GPT-2 weights into our GPT class.
# HuggingFace names the tensors slightly differently; this map resolves it.
from transformers import GPT2LMHeadModel

def load_gpt2_pretrained(size="gpt2"):
    configs = {
        "gpt2":        dict(n_layer=12, n_head=12, d_model=768),
        "gpt2-medium": dict(n_layer=24, n_head=16, d_model=1024),
        "gpt2-large":  dict(n_layer=36, n_head=20, d_model=1280),
        "gpt2-xl":     dict(n_layer=48, n_head=25, d_model=1600),
    }[size]
    cfg = GPTConfig(**configs, block_size=1024, vocab_size=50257, dropout=0.0)
    model = GPT(cfg)

    hf = GPT2LMHeadModel.from_pretrained(size)
    hf_sd, sd = hf.state_dict(), model.state_dict()

    # HF uses Conv1D (transposed linear) in attn/mlp — need to transpose those.
    transpose = ["attn.c_attn.weight", "attn.c_proj.weight",
                 "mlp.c_fc.weight", "mlp.c_proj.weight"]

    name_map = {
        "wte.weight": "tok_emb.weight",
        "wpe.weight": "pos_emb.weight",
        "ln_f.weight": "ln_f.weight",
        "ln_f.bias":   "ln_f.bias",
    }
    for i in range(cfg.n_layer):
        for hf_k, our_k in [
            ("ln_1.weight",     "ln1.weight"),
            ("ln_1.bias",       "ln1.bias"),
            ("attn.c_attn.weight","attn.c_attn.weight"),
            ("attn.c_attn.bias", "attn.c_attn.bias"),
            ("attn.c_proj.weight","attn.c_proj.weight"),
            ("attn.c_proj.bias", "attn.c_proj.bias"),
            ("ln_2.weight",     "ln2.weight"),
            ("ln_2.bias",       "ln2.bias"),
            ("mlp.c_fc.weight", "mlp.c_fc.weight"),
            ("mlp.c_fc.bias",   "mlp.c_fc.bias"),
            ("mlp.c_proj.weight","mlp.c_proj.weight"),
            ("mlp.c_proj.bias", "mlp.c_proj.bias"),
        ]:
            name_map[f"h.{i}.{hf_k}"] = f"blocks.{i}.{our_k}"

    loaded = 0
    with torch.no_grad():
        for hf_name, our_name in name_map.items():
            t = hf_sd[hf_name]
            if any(hf_name.endswith(s) for s in transpose):
                t = t.t()
            sd[our_name].copy_(t)
            loaded += 1
    print(f"params: {sum(p.numel() for p in model.parameters()):,}")
    print(f"loaded {loaded} / {len(name_map)} pretrained tensors")
    return model

# Load it, generate with it
model = load_gpt2_pretrained("gpt2")
model.eval()

from transformers import GPT2Tokenizer
tok = GPT2Tokenizer.from_pretrained("gpt2")
prompt = "In a shocking finding, scientist discovered a herd of unicorns"
idx = torch.tensor(tok.encode(prompt)).unsqueeze(0)
out = model.generate(idx, max_new_tokens=40, temperature=0.8, top_k=40)
print(tok.decode(out[0].tolist()))

stdout

params: 124,439,808
loaded 124 / 124 pretrained tensors

prompt: "In a shocking finding, scientist discovered a herd of unicorns"
  → "living in a remote, previously unexplored valley, in the Andes Mountains.
     Even more surprising to the researchers was the fact that the unicorns spoke..."

Let that land. The same 120-line class you just read is structurally compatible with OpenAI's trained vinyl — once you account for HuggingFace's Conv1D storing its linears transposed. Copy the tensors across, set the model to eval(), feed it a prompt, and the needle drops on coherent English. Same computation, same diagram, 124 million learned numbers. A silent cabinet an hour ago; a jukebox playing a new song now.

numpy forward → pytorch GPT → pretrained GPT-2

tiny_gpt.py (200 lines, forward only)←→class GPT(nn.Module): 120 lines, train + generate

— autograd replaces manual shape bookkeeping; GELU replaces ReLU; dropout added

random 0.02 init for every param←→special init: c_proj std *= 1 / sqrt(2·N)

— keeps residual path from exploding as depth grows (GPT-2 paper, §2.3)

lm_head = new (d, V) matrix←→self.lm_head.weight = self.tok_emb.weight

— one line of weight tying, ~40M parameters saved

smoke test: logits.shape == (B, T, V)←→load_gpt2_pretrained("gpt2")

— same cabinet, OpenAI’s pressed vinyl — drops the needle in a few lines

Gotchas

Tying the wrong tensors. nn.Embedding.weight has shape (V, d). nn.Linear(d, V).weight also has shape (V, d) — PyTorch stores it as (out_features, in_features). So self.lm_head.weight = self.tok_emb.weight works because both are (V, d). If you accidentally write self.lm_head.weight = self.tok_emb.weight.T you'll get a shape mismatch — or worse, a silent bug if the dimensions happen to align.

Forgetting the final LayerNorm. Every modern GPT has an ln_f between the last transformer block and lm_head. Skip it and training either diverges or plateaus at a bad loss. It's one line, it's always there, it's easy to miss when you think “attention + FFN × N and I'm done.”

The residual-path init. GPT-2's paper initializes c_proj.weight — the last linear in attention and in the MLP — with std 0.02 / sqrt(2·N) instead of just 0.02. The division keeps variance from compounding as the residual stream passes through N blocks. Skip this and gradients explode on deep models. Copy it from nanoGPT; don't reinvent it.

Dropout during generate. If you forget model.eval() before sampling, dropout is still active and you're corrupting activations randomly on every step. The needle skips. Generations look confused. Always model.eval() for inference, model.train() to come back.

Train a 4-layer GPT on Tiny Shakespeare

Grab input.txt from Karpathy's char-rnn repo (about 1.1MB of Shakespeare concatenated). Use a character-level tokenizer (vocab_size ≈ 65) for speed. Configure GPTConfig(block_size=128, vocab_size=65, n_layer=4, n_head=4, d_model=128)— that's about 800K parameters, runs on a laptop CPU in an evening or a T4 GPU in a few minutes.

Train for 5000 steps with AdamW, lr=3e-4, batch_size=32. Sample from the model every 500 steps. You'll hear four stages of the record coming into focus:

Step 0: random characters ("q3x!p!v") — static on the vinyl.
Step 500: mostly real characters, random sequences.
Step 2000: recognizable words, bad grammar ("thee not thy lord the king").
Step 5000: almost-plausible Shakespeare — verse structure, speaker attributions, some syntactic coherence. Meaning still garbage, but the form is right. The DJ is sober-ish.

Bonus: try n_layer=6, d_model=192 and compare. The form locks in faster, and the model becomes worth reading out loud for comic effect.

What to carry forward. A GPT is four things in sequence: token + position embeddings, N identical transformer blocks (each = pre-norm attention + pre-norm FFN, both residual), a final LayerNorm, and a tied-weight linear head to the vocabulary. ~120 lines of PyTorch. Every open LM you read about — Llama, Mistral, Qwen, Gemma — is a handful of small edits to this cabinet: swap LayerNorm for RMSNorm, sinusoidal positions for RoPE, ReLU for SwiGLU, dense attention for grouped-query. The jukebox you just built is the jukebox they all ship.

Next up — Grouped Query Attention. The vinyl you just pressed has one quiet problem: every attention head keeps its own full-sized K and V record, and at inference time the KV cache bloats linearly with heads. Llama's fix, grouped-query-attention, is to make several query heads share one key/value record — carpooling instead of each head driving solo. Same cabinet, tighter grooves, a cheaper song at the same fidelity. That's the next lesson.

References

[01]
Language Models are Unsupervised Multitask Learners
Radford, Wu, Child, Luan, Amodei, Sutskever · OpenAI, 2019 — the GPT-2 paper, including architecture and init details
[02]
Language Models are Few-Shot Learners
Brown et al. · NeurIPS 2020 — GPT-3: same architecture, 117× more parameters
[03]
nanoGPT
Andrej Karpathy · github.com/karpathy/nanoGPT — the reference implementation this lesson follows
[04]
Using the Output Embedding to Improve Language Models
Press, Wolf · EACL 2017 — the weight-tying paper
[05]
Dive into Deep Learning — 11.9 Large-Scale Pretraining with Transformers
Zhang, Lipton, Li, Smola · d2l.ai
[06]
Scaling Laws for Neural Language Models
Kaplan et al. · OpenAI, 2020 — parameters, data, compute: power laws all the way down