Transformer Block

Attention + MLP + norms + residuals — one layer.

Hard
~15 min read
·lesson 3 of 3

Here is the lie every transformer diagram tells you: a transformer is a tall tower. A monolithic 96-story skyscraper of matrix multiplies, purpose-built floor by floor, which is why it cost a hundred million dollars to train. The diagram shows you the tower. It does not show you the brick.

A transformer is not a tower. It is one Lego brick, stamped out of a mold, snapped on top of itself N times. GPT-2 small is the same brick stacked 12 times. GPT-3 is the same brick stacked 96 times. LLaMA, Mistral, Claude, every decoder-only model you've heard of — same brick, different N. The brick has a fixed shape. It takes tensors of shape (B, T, d_model) in, it returns tensors of the exact same shape out. That's not a coincidence. That is the entire design.

This lesson is about the brick. Once you can hold the brick in your head — two sub-layers, two residual adds, two layer-norms, one MLP — you can stop memorizing architectures and start reading them. A 175B-parameter model is not a new invention; it is the same brick we are about to build, stacked until the GPUs cry.

Transformer block (personified)
I am one Lego brick. Tokens come in, look at each other through my attention, think privately through my feed-forward, and leave a little smarter — same shape they arrived in. Snap me onto another me and you have a 2-layer model. Snap ninety-six of me together and you have GPT-3. I am not the tower. I am the brick the tower is made of.

Everything you've met so far was a part, not a whole. Token embeddings turned strings into vectors. Positional encodings gave those vectors an address. multi-head attention let each token look at every other token. An MLP gave each token a moment alone to think. layer norm kept the statistics from drifting. residual connections kept the gradients alive. Great parts. Scattered across six different lessons. Now they all snap into the same brick.

The transformer block is the one repeating unit that wraps all of that into a single composable layer. Same shape in, same shape out, different learned weights per copy. GPT-2 small stacks 12 bricks. GPT-3 stacks 96. Scale is literally just how many bricks you line up.

Here is the brick in ASCII — the pre-norm variant every modern codebase uses. Data flows bottom to top. The two nodes are residual additions; the two LN boxes are layer-norms applied before each sub-layer. Look at the top and bottom arrows — the shapes are identical, which is the whole reason the brick snaps onto itself.

                    ▲  out  (B, T, d_model)
                    │
                    ⊕ ◄──────────────┐   residual #2
                    │                │
                  ┌─┴──┐              │
                  │ MLP│   (d → 4d → d)
                  └─┬──┘              │
                    │                │
                  ┌─┴──┐              │
                  │ LN │   layer-norm before FFN
                  └─┬──┘              │
                    ├──────────────►──┘
                    │
                    ⊕ ◄──────────────┐   residual #1
                    │                │
                  ┌─┴──┐              │
                  │Attn│   multi-head self-attention
                  └─┬──┘              │
                    │                │
                  ┌─┴──┐              │
                  │ LN │   layer-norm before Attn
                  └─┬──┘              │
                    ├──────────────►──┘
                    │
                    ▲  in   (B, T, d_model)
one transformer block — pre-norm (GPT-2 style)

The brick is two lines of math. Given an input x of shape (B, T, d_model):

the transformer block — pre-norm form
x   ←   x   +   Attn( LN(x) )          # residual #1

x   ←   x   +   FFN ( LN(x) )          # residual #2

Two sub-layers, two residual adds, two layer-norms. Shape in equals shape out — which is exactly what lets you stack the brick N times without rewiring anything. If you can read those two lines, you can read the forward pass of every GPT-family model ever shipped. The rest is weight counts and marketing.

Look at the brick again and something pops out: it has two halves. Top half is feed-forward-and-residual. Bottom half is attention-and-residual. They are not doing the same job, and that's the single most useful thing to know about the whole architecture.

transformer block — click any node
d = 512 · T = 512
residualresidualinput xLayerNormMulti-Head Attentionadd residualLayerNormMLP (4d up + down)add residualoutput y
Multi-Head Attention

QKV projection, attention matmul, output projection. Cost scales with T·d + T²·d.

FLOPs at d = 512, T = 512
805.31 M27.2% of block
breakdown — per layer
LayerNorm
1.31 M0%
Multi-Head Attention
805.31 M27%
add residual
262.14 K0%
LayerNorm
1.31 M0%
MLP (4d up + down)
2.15 G73%
add residual
262.14 K0%
d_model
seq_len
block FLOPs2.96 G

Click each component in the diagram. The attention sub-layer mixes information across tokens — this is the only place in the whole brick where one token's representation is touched by its neighbours. The feed-forward sub-layer then runs on each token independently, with the same MLP weights applied at every position. Layer-norm keeps activations from drifting across the stack. The residuals — the x + ... part — are what make the whole tower trainable in the first place.

Residual connection (personified)
I am the + x. I look like a trivial wire, but without me this building falls over. My gradient is a clean 1 no matter how tall you stack the network — every brick upstream can reach every brick downstream through me. I'm why you can train a 96-layer transformer and it actually learns.

Here is the quiet miracle of the design: the brick stacks. Not metaphorically — in the literal PyTorch sense that you can write nn.Sequential(*[Block() for _ in range(96)]) and it just works. The reason is two promises the brick keeps, and a historical fact about what happens when you break them.

Promise one: shape in equals shape out. No funky broadcasts, no dimensional surgery between layers. Every brick sees exactly the same tensor shape its sibling one floor down saw. That's the mechanical half.

Promise two: gradients survive the climb. The residual adds are not a stylistic choice — they are a load-bearing wall. Every time you write x + Attn(LN(x)), you are handing the next layer an unobstructed wire back to the input. When backprop runs, gradients flow through that wire with a derivative of exactly 1, bypassing whatever nonlinear mess the sub-layer added. Stack 96 bricks without that wire and you re-run the old RNN horror show of vanishing gradients — the signal gets diluted through each layer until the bottom of the stack effectively stops learning.

Promise two is also why pre-norm beat post-norm. Put the layer-norm insidethe residual path (post-norm, the original paper) and the identity wire is no longer clean — gradients have to fight through an LN on every floor. Move the LN outside the residual path (pre-norm, modern default) and the wire is pristine again. Same brick, same Lego studs, one screw moved — and the difference is the ability to train past about 20 layers without a warmup schedule from hell.

Zoom into the FFN half of the brick. It's the boring, beautiful workhorse — a two-layer MLP with a nonlinearity between them. It's also, for almost every transformer ever trained, where most of the parameters live.

feed-forward network — per-token MLP
FFN(x)  =   GELU( x · W₁ + b₁ ) · W₂ + b₂

shapes:
  W₁ :  (d_model, 4·d_model)     # expand
  W₂ :  (4·d_model, d_model)     # contract

x is applied per-token — no mixing across positions.

The ratio is not magic, but it is load-bearing. The original paper set the inner dimension to 4·d_model, and every major transformer since has kept that shape (give or take — GLU-style variants use ~8·d_model/3 to keep the parameter count the same after splitting). The inner expansion gives the MLP room to route and combine features before projecting back down.

Crucially, the FFN is applied independently to every position. Same weights, same computation, different token. The MLP does not know that other tokens exist. That sounds like a limitation until you remember that attention already did the cross-token mixing on the previous sub-layer — the division of labor is the whole point. Each half of the brick has one job and does only that job.

scale the block — stack them deep
12 × block · d = 768 · T = 1024
stack (bottom = input · top = output head)
1
7.08 M
2
7.08 M
3
7.08 M
4
7.08 M
5
7.08 M
6
7.08 M
emergent in-context learning appears
7
7.08 M
8
7.08 M
9
7.08 M
10
7.08 M
11
7.08 M
12
7.08 M
small-GPT range
per-block params = 12·d² + 4d = 7.08 M
total parameters
123.57 M
blocks: 84.97 M · embedding: 38.60 M
memory footprint
weights (fp16): 247.14 MB
activations ≈: 18.87 MB
(activation estimate for one forward pass.)
compute per token
160.43 M FLOPs
per-token ≈ 2·N·params (a useful rule of thumb).
preset: GPT-2 small

124M params · the original "GPT-2 small".

total123.57 M

Move the N slider. Watch what doesn't change: the input shape, the output shape, the per-block structure. Watch what does: the parameter count and the FLOP budget, both scaling linearly in N. That's what “stacking bricks” actually means in practice. GPT-2 small: 12 bricks, d_model = 768, 117M parameters. GPT-2 medium: 24 bricks, d_model = 1024, 345M. Large and XL keep adding bricks and widening d_model. Same Lego set. More pieces.

Per-block parameters are roughly 12 · d_model² — four matrices of shape d_model × d_model for attention (W_Q, W_K, W_V, W_O) and eight d_model²-worth in the FFN (the expansion means W₁ and W₂ are each 4 · d_model² in size). At d_model = 768 that's about 7M parameters per brick. Multiply by 12 bricks and the stack alone is ~85M — essentially the entirety of GPT-2 small.

Feed-forward MLP (personified)
I am the per-token computer. Attention does the gossip; I do the thinking. Each token gets my full undivided attention, one at a time, with the same weights every time. I hold two thirds of the parameters in this brick. If the model “knows facts,” they live in my rows.

Three layers of code, same pattern every lesson in this series uses. First the forward pass of a single brick in numpy with every step visible — no autograd magic, just matrix multiplies. Then the one-line PyTorch built-in. Then a hand-rolled module that mirrors every production codebase. Each layer is a shortcut for the one below it, never magic.

layer 1 — numpy · transformer_block_numpy.py
python
import numpy as np

def layer_norm(x, eps=1e-5):
    mu  = x.mean(axis=-1, keepdims=True)
    var = x.var (axis=-1, keepdims=True)
    return (x - mu) / np.sqrt(var + eps)

def gelu(x):
    # exact GELU; in practice PyTorch uses the tanh approximation
    return 0.5 * x * (1.0 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

def attention(x, W_qkv, W_o):
    B, T, d = x.shape
    qkv = x @ W_qkv                          # (B, T, 3d)
    q, k, v = np.split(qkv, 3, axis=-1)      # each (B, T, d)
    scores = q @ k.transpose(0, 2, 1) / np.sqrt(d)
    return softmax(scores) @ v @ W_o         # (B, T, d)

def ffn(x, W1, W2):
    return gelu(x @ W1) @ W2                 # (B,T,d) -> (B,T,4d) -> (B,T,d)

def transformer_block(x, params):
    # pre-norm: LN happens before each sub-layer; residual is added AFTER
    h = x + attention(layer_norm(x), params['W_qkv'], params['W_o'])
    h = h + ffn      (layer_norm(h), params['W1'   ], params['W2' ])
    return h

# random weights just to show the shapes flow
rng  = np.random.default_rng(0)
d, T, B = 64, 5, 2
x = rng.standard_normal((B, T, d))
params = {
    'W_qkv': rng.standard_normal((d, 3*d)) * 0.02,
    'W_o'  : rng.standard_normal((d,   d)) * 0.02,
    'W1'   : rng.standard_normal((d, 4*d)) * 0.02,
    'W2'   : rng.standard_normal((4*d, d)) * 0.02,
}
y = transformer_block(x, params)
print("input shape: ", x.shape)
print("output shape:", y.shape)
print(f"max |Δ|: {np.abs(y - x).max():.4f} (residual kept identity path alive)")
stdout
input shape:  (2, 5, 64)
output shape: (2, 5, 64)
max |Δ|: 0.3741 (residual kept identity path alive)

Production PyTorch gives you this as a single layer. Pass in the shape arguments, you get a drop-in brick. The only catch is that nn.TransformerEncoderLayer defaults to post-norm unless you pass norm_first=True — which you want, if you are training from scratch in 2025.

layer 2 — pytorch built-in · transformer_block_torch.py
python
import torch
import torch.nn as nn

block = nn.TransformerEncoderLayer(
    d_model        = 768,
    nhead          = 12,
    dim_feedforward= 4 * 768,      # the 4× rule
    activation     = 'gelu',
    norm_first     = True,         # pre-norm (modern default)
    batch_first    = True,
)

x = torch.randn(2, 5, 768)
y = block(x)
print("output shape:", y.shape)
print(f"parameters:   {sum(p.numel() for p in block.parameters()):,}")
stdout
output shape: torch.Size([2, 5, 768])
parameters:   7,087,872

Rolling your own brick is a fifteen-line module. This is the version you want to read, copy, and keep in your head — the code that will appear, nearly verbatim, in every tutorial, blog post, and real implementation from nanoGPT onward.

layer 3 — pytorch hand-rolled · transformer_block.py
python
import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerBlock(nn.Module):
    def __init__(self, d_model=768, n_heads=12, mlp_ratio=4, dropout=0.1):
        super().__init__()
        self.ln1  = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(d_model, n_heads,
                                          dropout=dropout, batch_first=True)
        self.ln2  = nn.LayerNorm(d_model)
        self.mlp  = nn.Sequential(
            nn.Linear(d_model, mlp_ratio * d_model),
            nn.GELU(),
            nn.Linear(mlp_ratio * d_model, d_model),
            nn.Dropout(dropout),
        )

    def forward(self, x, attn_mask=None):
        # pre-norm, residual added *after* each sub-layer
        h        = self.ln1(x)
        h, _     = self.attn(h, h, h, attn_mask=attn_mask, need_weights=False)
        x        = x + h                                   # residual #1
        x        = x + self.mlp(self.ln2(x))               # residual #2
        return x

blk = TransformerBlock(d_model=768, n_heads=12)
x   = torch.randn(2, 5, 768)
y   = blk(x)
print("output shape:", y.shape)
print(f"parameters:   {sum(p.numel() for p in blk.parameters()):,}")
print("residual preserved shape:", y.shape == x.shape)
stdout
output shape: torch.Size([2, 5, 768])
parameters:   7,087,872
residual preserved shape: True
numpy → hand-rolled pytorch
h = x + attention(layer_norm(x), ...)←→x = x + self.attn(self.ln1(x), ...)

same two-liner; pytorch adds autograd + GPU + dropout

gelu(h @ W1) @ W2←→nn.Sequential(Linear, GELU, Linear)

the 4× inner dim is the mlp_ratio argument

hand-written softmax + mask←→nn.MultiheadAttention(attn_mask=...)

built-in handles Q/K/V split, heads, and masking

Gotchas

LN placement: if you copy a brick from an old tutorial and find it will not train past a few layers, check whether it's post-norm. Flip to pre-norm and watch the loss curve go from spiky to smooth.

The 4× is not universal: LLaMA-style SwiGLU FFNs use ~8·d_model/3 for the hidden dim. The reason is that SwiGLU splits the inner projection into two halves — keeping total parameters equal to the standard 4× block requires shrinking. If you are comparing parameter counts between architectures, check which FFN variant they are using.

FFN does NOT mix tokens: a common misread — the FFN is applied per-position with the same weights. It is a 1×1 conv over the sequence, not an attention-like mixer. Every piece of cross-token communication in a transformer happens in the attention sub-layer, full stop.

Dropout placement: some references put dropout inside the attention scores; some put it on the sub-layer output; some do both. For reproducing a paper, check the exact placement. For a new model, put it on the sub-layer output and move on with your life.

Stack four blocks and watch it learn

Take the TransformerBlock class from layer 3. Wrap it in a tiny language model: an embedding layer (vocab_size=256, byte-level), positional encoding, 4 stacked blocks (d_model=128, n_heads=4), a final layer-norm, and a linear head back to vocab size. Use a causal mask.

Train it to predict the next character of a 10KB chunk of text (the opening of any book from Project Gutenberg will do). Batch size 32, sequence length 128, AdamW at lr=3e-4. A single RTX card or even a laptop CPU is enough.

Log the loss every 50 steps. Within 2000 steps the loss should drop from ~5.5 (random over 256 bytes) to ~2.0 (model has learned the alphabet's shape). Sample from it at the end — it will produce English-looking gibberish with correct spacing, plausible word lengths, and the occasional real word. That is four transformer blocks doing exactly what a hundred of them do at scale.

Bonus: change N from 4 to 1 and train again. Note how much harder the model has to work to model the same sequence — depth matters, and you will feel it in the loss curve.

What to carry forward. A transformer is not a tower; it's one brick, snapped together N times. The brick itself is attention + FFN wrapped in two residuals and two layer-norms. Pre-norm is the modern default because it keeps the residual wire clean. The FFN half holds most of the parameters and runs per-token; the attention half is the only place tokens communicate. Shape in equals shape out, which is why the stack is a one-line for loop. Every modern LLM is this brick, stacked.

Cliffhanger — the brick wasn't built just for text. Look at the brick's input shape: (B, T, d_model). A batch of sequences of vectors. Nothing in that signature says “language.” Text isn't the only grid that fits into this brick — images are just 2D text if you squint. Chop a picture into 16×16 patches, flatten each patch into a vector, and you've got (B, T, d_model) where T is “number of patches” and each patch is a “word.” Snap the same brick you just built on top of that and you have a vision transformer — the model that walked into computer vision and politely retired a decade of CNN architectures. Same brick. Different input. That's next.

References