Multi-Layer Backpropagation

Chain rule across arbitrarily deep networks.

Hard
~15 min read
·lesson 3 of 6

One-neuron backprop was a single-manager operation. One weight misbehaved, one gradient told it how to behave better, you updated and moved on. Easy org chart. Short meeting.

A multi-layer network is a corporate hierarchy. The loss is a customer complaint that lands on the CEO's desk — the output layer, which is the only one that ever sees the right answer. The CEO was wrong. Rather than fire herself, she walks down the org chart handing out blame notes: this much of the mistake was because of your numbers. Her VPs take their share, break it down further for their managers, and so on. All the way down to the first hidden layer — the intern who read the raw input. Every weight in the company gets a blame note proportional to how much it contributed to the miss. Then everyone adjusts.

That's backprop through depth. The math is still the chain rule, just now with a subscript per layer. This lesson is where the recursion shows up — the pattern that lets you compute gradients through a network of arbitrary depth in the same asymptotic time as the forward pass. It's the reason deep learning is computationally feasible. It's also the reason training deep nets was impossible for twenty years between Minsky's 1969 book and Rumelhart-Hinton-Williams' 1986 rediscovery. Once you know the recursion it feels obvious. Before, it genuinely isn't.

Set up the company. Number the layers ℓ = 1, 2, …, L. The intern is layer 1, the CEO is layer L. Every manager does the same job — take what the layer below reported, multiply by your own weights, add a bias, squash through a nonlinearity, pass it up:

a generic dense layer
z_ℓ   =   W_ℓ · a_(ℓ-1)  +  b_ℓ                 linear combo
a_ℓ   =   f(z_ℓ)                                activation

Inputs to the first layer:    a_0 = x
Outputs of the last layer:    ŷ = a_L

Now follow one complaint backwards. Define δ_ℓ ≡ ∂L / ∂z_ℓ — the blame note for layer 's pre-activation. It's the same quantity you computed for a single neuron last lesson, just with a subscript. Once a manager holds their δ, the gradients they need to actually fix anything are trivial — weights and biases fall out immediately:

parameter gradients from δ — identical at every layer
∂L/∂W_ℓ   =   δ_ℓ · a_(ℓ-1)ᵀ            outer product: (out, in) shape
∂L/∂b_ℓ   =   δ_ℓ                        same shape as b_ℓ

So the only real question is: how does a middle manager get their δ in the first place? Answer: from the manager directly above, who already has theirs. That's the punchline — the one equation that makes deep learning possible, and the rule that runs the whole corporate blame chain:

the backprop recurrence — memorize this
δ_ℓ   =   (W_(ℓ+1)ᵀ · δ_(ℓ+1))  ⊙  f'(z_ℓ)

Base case (at the output layer):
δ_L   =   ∂L / ∂a_L  ⊙  f'(z_L)

where  ⊙  is elementwise multiplication.

Read the recurrence as a conversation. The manager above hands you their blame note δ_(ℓ+1). You pull it back through the weights that connected you to them (W_(ℓ+1)ᵀ — the transpose is how the blame flows against the direction of the forward pass). Then you mask by your own slope — f'(z_ℓ) — because a neuron that was sitting in a flat part of its activation didn't move the needle, so it doesn't deserve much blame. That's your δ. Hand it to the manager below. Repeat until you reach the intern. Every step is cheap and local. Total cost: one matmul per layer — the same asymptotic cost as the forward pass.

δ (delta) (personified)
I am the blame note. If you tell me how wrong the CEO was, I can walk myself down every floor of the building and deliver the right share to every manager on the way. I'm the only reason any weight in the network knows which way to move. Without me, nobody downstairs even knows there was a complaint.

Click through the steps below. Each click either advances the forward pass one layer or walks the blame one more floor down the org chart. Watch the forward values fill in left to right (cyan) — that's the company doing its actual job. Then the pink δ bubbles appear right to left — that's the walk of shame. The gradient sitting on each neuron is the blame note the recurrence just handed it.

multi-layer backprop — δ flows right to left
3 → 4 → 3 → 1, ReLU hidden, linear out·waiting
0.60-0.400.90inputhidden 1hidden 2output
forward value flow backward gradient flowΔ at each node = dL/dz for that neuron
step 0 / 7
loss = 0.4142

There's a catch hiding in the recurrence, and it's the thing that drives the next two lessons. At every layer, the blame note gets multiplied by f'(z). For sigmoid and tanh, that derivative is at most 0.25, and usually smaller. Every floor the note passes through, it shrinks. After twenty floors — (0.25)²⁰ ≈ 10⁻¹² — the intern gets a blame note that reads “you were wrong by zero point zero zero zero zero zero zero zero zero zero zero zero one.” The intern shrugs and does nothing. Your early layers stop training.

Run the arithmetic the other way and the same mechanism blows up. If the weights are large, the chain-rule product of Jacobians compounds: the intern gets a blame note the size of a small country and updates themselves into orbit. The model is cooked.

This is the vanishing gradient problem and its evil twin, the exploding gradient problem — and it was why sigmoid + deep networks just didn't work in the 1990s. The fix is a stack of architectural and initialisation choices we'll unpack over the next few lessons: ReLU (derivative is 1 where active, so the note survives), careful weight initialisation (Xavier/He — two lessons from now, and the whole reason it's in the curriculum), and residual connections (a standalone equation: a_ℓ = f(W · a_(ℓ-1)) + a_(ℓ-1), which adds an identity path the blame note can use when the multiplicative one collapses).

gradient magnitude — how much signal reaches layer k from the loss
log₁₀ scale · residual toggle resets the derivative gain
sigmoid2.0e+4
tanh3.5e-5
relu0.001

Move the depth slider. With L = 10, all three activations still have usable gradients — the org chart is short enough that the note arrives with signal. By L = 20, sigmoid + naive init has collapsed to 10⁻⁹. By L = 30 it's under float precision — the early layers literally cannot train, blame note arrives as a rounding error. Now tick “residual connections.” Every curve flattens or recovers. Residual paths add an identity gradient term ∂a_ℓ/∂a_(ℓ-1) = I + (chain stuff), so depth stops compounding the vanishing factor. That's the entire mathematical justification for ResNet, and it's also why every modern transformer has one of those x + attention(x) shortcuts wired around each sublayer.

Gotchas

Gradient clipping fixes the other failure mode. When the chain-rule product explodes instead of collapses — common with bad init or recurrent nets — each step sends the model off to infinity. Clip the gradient norm to some ceiling — torch.nn.utils.clip_grad_norm_(params, max_norm=1.0) — before calling optimizer.step(). Cheap, almost always correct, done.

Don't confuse δ with ∂L/∂a. δ is the blame at the pre-activation. The gradient at the post-activation is W_(ℓ+1)ᵀ · δ_(ℓ+1) (which is δ at the next layer, before the activation). This distinction matters when you implement backprop by hand — off by one layer and nothing works.

Activation derivatives need the pre-activation. f'(z_ℓ), not f'(a_ℓ). For sigmoid there's a shortcut (σ'(z) = σ(z)(1-σ(z)) = a(1-a)) because the derivative can be written in terms of the output. For most other activations you have to stash the pre-activation on the forward pass. Forget, and your backward pass quietly lies — the worst kind of bug.

Three implementations, same company, three different management styles. Pure Python runs the recurrence with explicit nested loops — every blame note computed by hand, painful but unambiguous. NumPy compresses the per-layer step into three matrix operations. PyTorch lets you write only the forward pass and throws away the whole lesson — you declare what the company does, it figures out the blame chain for you.

layer 1 — pure python · mlp_backprop_scratch.py
python
import math, random

def sigmoid(z): return 1/(1+math.exp(-z))
def relu(z): return z if z > 0 else 0.0

random.seed(0)
# 3 → 4 → 1 net, sigmoid output, MSE loss
def init_matrix(inD, outD): return [[random.gauss(0, 0.5) for _ in range(inD)] for _ in range(outD)]
W1 = init_matrix(3, 4); b1 = [random.gauss(0, 0.1) for _ in range(4)]
W2 = init_matrix(4, 1); b2 = [0.0]

def train_step(x, y, lr=0.3):
    global W1, b1, W2, b2
    # Forward
    z1 = [sum(W1[i][j]*x[j] for j in range(3)) + b1[i] for i in range(4)]
    a1 = [relu(v) for v in z1]
    z2 = sum(W2[0][j]*a1[j] for j in range(4)) + b2[0]
    yhat = sigmoid(z2)
    loss = 0.5 * (yhat - y) ** 2

    # Backward
    d2 = (yhat - y) * yhat * (1 - yhat)              # δ at output
    dW2 = [[d2 * a1[j] for j in range(4)]]           # ∂L/∂W2 = δ · a1ᵀ
    db2 = [d2]

    d1 = [sum(W2[0][j] * d2 for _ in [0]) * (1 if z1[j] > 0 else 0)
          for j in range(4)]                         # δ at hidden (ReLU')
    dW1 = [[d1[i] * x[j] for j in range(3)] for i in range(4)]
    db1 = list(d1)

    # Update
    for i in range(4):
        for j in range(3):
            W1[i][j] -= lr * dW1[i][j]
        b1[i] -= lr * db1[i]
    for j in range(4):
        W2[0][j] -= lr * dW2[0][j]
    b2[0] -= lr * db2[0]
    return loss

for step in range(1000):
    loss = train_step([0.6, -0.4, 0.9], 1.0)
print(f"final loss = {loss:.4f}")
stdout
final loss = 0.0183
layer 2 — numpy · mlp_backprop_numpy.py
python
import numpy as np

rng = np.random.default_rng(0)
W1 = rng.normal(0, 0.5, size=(4, 3))   # layer 1: 3 → 4
b1 = rng.normal(0, 0.1, size=(4,))
W2 = rng.normal(0, 0.5, size=(1, 4))   # layer 2: 4 → 1
b2 = np.zeros(1)

def sigmoid(z): return 1 / (1 + np.exp(-z))

def step(X, y, lr=0.3):
    global W1, b1, W2, b2
    # Forward (batched)
    z1 = X @ W1.T + b1                                 # (N, 4)
    a1 = np.maximum(0, z1)
    z2 = a1 @ W2.T + b2                                # (N, 1)
    yhat = sigmoid(z2).ravel()
    loss = 0.5 * np.mean((yhat - y) ** 2)

    # Backward — apply the recurrence in matrix form
    delta2 = ((yhat - y) * yhat * (1 - yhat)).reshape(-1, 1) / len(X)  # (N, 1)
    dW2 = delta2.T @ a1                                # (1, 4)
    db2 = delta2.sum(axis=0)                           # (1,)

    delta1 = (delta2 @ W2) * (z1 > 0)                  # (N, 4) · ReLU'
    dW1 = delta1.T @ X                                 # (4, 3)
    db1 = delta1.sum(axis=0)                           # (4,)

    # Update
    W2 -= lr * dW2; b2 -= lr * db2
    W1 -= lr * dW1; b1 -= lr * db1
    return loss
pure python → numpy
for i for j: dW1[i][j] = d1[i] * x[j]←→dW1 = delta1.T @ X

nested loops collapse into one transpose-matmul

z1[j] > 0←→(z1 > 0).astype(float)

the ReLU derivative — elementwise mask

per-example loop←→batch · broadcast · done

NumPy handles the N dimension for free

layer 3 — pytorch · mlp_backprop_pytorch.py
python
import torch
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(3, 4)
        self.fc2 = nn.Linear(4, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return torch.sigmoid(self.fc2(x))

model = MLP()
optimizer = torch.optim.SGD(model.parameters(), lr=0.3)

x = torch.tensor([[0.6, -0.4, 0.9]])
y = torch.tensor([[1.0]])

for step in range(1001):
    optimizer.zero_grad()
    yhat = model(x)
    loss = 0.5 * (yhat - y).pow(2).mean()
    loss.backward()                      # autograd runs the full backward recurrence
    optimizer.step()
    if step in (0, 200, 600):
        print(f"step {step}: loss = {loss.item():.4f}")
stdout
step 0: loss = 0.1762
step 200: loss = 0.0193
step 600: loss = 0.0024
numpy → pytorch
delta2 = (yhat - y) * yhat * (1-yhat)←→loss.backward() # one call

autograd walks the recurrence for every parameter

dW1 = delta1.T @ X ; dW2 = delta2.T @ a1←→stored in .grad automatically

you never write the outer product

transpose + matmul dance←→model.parameters() — just weights to update

the actual gradient flow is invisible

Build a 2-layer MLP that solves XOR

Take the pure-Python two-layer code above. Set the architecture to 2 → 2 → 1 (two inputs, two ReLU hidden units, one sigmoid output). Train on XOR — [(0,0,0), (0,1,1), (1,0,1), (1,1,0)] — for 5000 steps at lr = 0.5. It should hit loss below 0.01. Print the hidden-layer weights at the end — interpret them. Do the two hidden neurons look like they learned to detect “exactly one input is 1”?

This is the smallest network that solves XOR. You're repeating Rumelhart-Hinton-Williams' 1986 result on a laptop in twelve lines of NumPy. That paper changed the course of AI.

What to carry forward. Multi-layer backprop is one equation repeated: δ_ℓ = (W_(ℓ+1)ᵀ · δ_(ℓ+1)) ⊙ f'(z_ℓ). A CEO with a complaint, an org chart walking blame downward, every manager's share proportional to how much they moved things. Parameter gradients fall out as outer products. Total cost equals the forward pass. And the dirty secret: the same product of Jacobians that makes the algorithm cheap is also what makes deep networks fragile — blame collapses to zero or explodes to infinity as the chain grows. The widget curves are that story made visible, and they're the whole reason gradient descent in a deep net isn't automatically easy.

Next up — Backprop Ninja. The theory holds. Now do the derivation yourself, on paper, for a two-layer MLP — every δ, every outer product, every elementwise mask — and we'll check your work with finite differences. If you got off by one transpose anywhere in this lesson, ninja will catch it. Bring a pencil.

References