Backprop Ninja

Derive backward for a 2-layer MLP by hand — checked live against finite differences.

Hard

~15 min read

·lesson 4 of 6

You've seen backpropagation derived in the abstract. You've watched the chain rule march through a multi-layer network on paper. Comfortable? Good. Now prove it.

Close the tab with the autograd library. Open a blank editor. Derive the gradients of a two-layer MLP by hand — every op, every shape, every sum. No loss.backward(). No autograd.grad. Just you, the chain rule, and a lab assistant standing behind you with a ruler.

The lab assistant is finite differences. You write an analytic gradient — your hypothesis about how the loss changes when you nudge one value. The assistant nudges that value by a tiny epsilon, runs the forward pass twice, subtracts, divides. If your formula matches their measurement to about 1e-6, you're right. If not, they raise an eyebrow and you redo the math. No hand-waving possible. The numbers either tie out or they don't.

Andrej Karpathy calls this exercise becoming a backprop ninja, and he's right to make a big deal of it. The first time every gradient passes, you stop being afraid of the backward pass forever. Ten minutes if you're sharp, two hours if you're learning it honestly. Both are fine. What matters is that every line you write gets checked.

Chain rule (personified)

I am one sentence: dL/dx = (dL/dy)·(dy/dx). That's it. If you know what comes out of a box and you know what the box does, I tell you what went in. Compose me n times and I still take exactly one line.

Lab assistant (personified)

I don't care about your derivation. I nudge x by ε, I run forward twice, I subtract, I divide. My answer is close enough to the truth that if yours doesn't match mine, yours is wrong. I'm watching.

Here is the exact network the checker below runs against. Four examples, three inputs, a hidden layer of five, three output classes, cross-entropy loss. Tiny on purpose — the assistant has to finite-difference every parameter, which costs O(P · 2 · forward), and we want the audit to finish in under a second.

forward pass

z1  =  X @ W1 + b1         # (N, H)
h   =  tanh(z1)             # (N, H)
z2  =  h @ W2 + b2          # (N, K)
p   =  softmax(z2)          # (N, K)
L   =  −(1/N) · Σᵢ log p[i, yᵢ]

Four parameters to get gradients for: W1 (3, 5), b1 (5,), W2 (5, 3), b2 (3,). The forward cache hands you everything you need on the way down: X, z1, h, z2, logp. No peeking at gradients — those are your job. The assistant's job is to catch you if you botch one.

One derivation you absolutely must have memorised: softmax followed by negative-log-likelihood. If you try to chain through softmax with a generic Jacobian you will suffer — and the assistant will watch you suffer. The fused rule is three symbols:

softmax + NLL — combined gradient

∂L/∂z2  =  (p − Y) / N

(1.1)

where p is the softmax probabilities and Y is the one-hot of the labels. Three terms. That's the whole thing. Every other gradient in this lesson is mechanical after (1.1) — composition of building blocks you already have, each one verified numerically before you move on.

building blocks you already have

y = x @ W←→dL/dx = dL/dy @ W.T; dL/dW = x.T @ dL/dy

— matrix-multiply backward — transpose on the other side

y = x + b (broadcast)←→dL/db = dL/dy.sum(axis=0)

— bias backward — sum across whatever axis broadcast

y = tanh(x)←→dL/dx = dL/dy * (1 − y**2)

— tanh derivative — reuse the forward output

y = log(softmax(z))[i, yᵢ] / N←→dL/dz = (p − onehot(y)) / N

— memorise this one (1.1) — chaining softmax + log is a trap

The checker is below. Edit the body of backward(), press check, and watch four red X's turn green one at a time. The widget is the assistant made visible: it runs your analytic code, then re-runs the forward pass 2·P times with each parameter nudged by ±ε, assembles a central-difference gradient, and compares. Pass = max absolute error under 1e-4. That threshold is generous. If your math is right, you'll tie out closer to 1e-9.

backprop_ninja.py · fill in backward()

Recommended order: dW2 and db2 first — those are cleanest, and they unblock everything else. Then dh as an intermediate (it doesn't need to be returned, but it's the bridge from the output layer back to the hidden one). Then dz1 = dh * (1 − h**2), then dW1, then db1. If one tensor passes and the next one fails, the bug lives after the one that passed — chain-rule errors propagate, they don't teleport. The assistant is effectively giving you a binary search over your own derivation.

This is the payoff moment the whole lesson is built around. You derive one op, the assistant numerically verifies it, you move on. Derive, check, tie out. Derive, check, tie out. By the time the fourth green check lands, you haven't just written backprop — you've audited it.

Gotchas

Shape mismatch (broadcasting bit you). dW1 must match W1. If the assistant reports wrong shape, you almost certainly forgot a transpose — X.T @ dz1, not X @ dz1. This is the classic place where the assistant catches a subtle bug you couldn't have talked yourself out of: the math looked right on paper, the numbers disagree, the transpose was the difference.

Wrong axis on the bias sum. b1 has shape (H,) but it got broadcast across the batch dimension during the forward pass, so its gradient is dz1.sum(axis=0) — sum over the examples, not over the hidden units. Get the axis wrong and the assistant reports the right magnitude on the wrong shape. This is the second-most-common bug in the whole exercise.

Off by a factor of N. Every gradient already includes the /N because it's baked into (1.1). Don't divide by N again inside the matmul gradients; the divide only happens once, at the softmax step.

Numerical error passes. Central difference with ε = 1e-5 is accurate to about 1e-9 on smooth functions. A gradient under 1e-4 passes comfortably. If yours is at 1e-3, that's not rounding — that's a bug the assistant is being polite about.

tanh' reuses the output. The derivative of tanh(x) is 1 − tanh(x)**2, which is 1 − h**2 — you already have h in the cache. No need to pass z1 through again.

Once every tensor passes — and it will pass, because the assistant doesn't grade on vibes — take a second and look at what you wrote. Maybe twelve lines of NumPy. A decade ago this was the entire backward pass of a production-adjacent classifier; today it's a warm-up. You just implemented, by hand, the object that PyTorch's autograd constructs dynamically for every model you've ever trained — and you verified it, op by op, against finite differences. Not “it compiled”; not “the loss went down”; actually verified.

The reason this is the standard rite of passage is that once you can write backprop for a two-layer MLP, the failure modes of bigger networks start to make sense. Vanishing gradients? You now know exactly which @ and which * is shrinking them. Exploding gradients? Same question, other direction. Shapes not matching? You can read a stack trace without fear, because you know what shape should be there. The ninja badge isn't that you did the math; it's that the numbers tied out and you know they did.

Ninja, part two

Swap tanh for ReLU in the forward pass. Work out the new dz1 (hint: the derivative of ReLU is a 0/1 mask you can read straight off z1). Everything else stays identical. Then run the check — if your new derivation is right, the assistant should tie out the same way it did before.

Bonus: add a third hidden layer and re-derive. You'll notice the pattern — every new hidden layer adds exactly two gradient lines to backward(). That pattern is what makes autograd possible in the first place.

What to carry forward. You can now write backprop for an MLP without looking anything up, and — more importantly — you can prove it. The chain rule is one line; the building blocks (matmul backward, bias backward, element-wise backward) are four. The softmax + NLL fusion is the only memorisation tax. Everything else is composition, and finite differences is the lab assistant who keeps you honest while you're composing.

Next up — MLP from Scratch. You've verified the math. Now bolt it into a training loop and make something actually learn. Forward, backward (the one you just wrote), gradient descent update, repeat. The assistant will quiet down — real training runs don't finite-difference every step — but the confidence you just earned is what lets you read your own loss curve and know whether your backward pass is the thing at fault.

References

[01]
Becoming a Backprop Ninjablog
Andrej Karpathy · Neural Networks: Zero to Hero, lecture 4 · 2022
[02]
CS231n: Backpropagation, Intuitionsblog
Andrej Karpathy · Stanford CS231n
[03]
The Matrix Calculus You Need For Deep Learningpaper
Parr & Howard · 2018
[04]
Deep Learning — Chapter 6: Deep Feedforward Networksbook
Goodfellow, Bengio, Courville · MIT Press, 2016