Backpropagation

The chain rule, made mechanical.

Medium
~15 min read
·lesson 2 of 6

You have a neuron. It takes an input, multiplies by weights, adds a bias, squashes the result through a non-linearity, spits out a prediction. You already know what to do next: nudge the weights to make the loss smaller. The only thing missing is the nudge direction — the partial derivative of the loss with respect to every parameter in the model.

For one neuron with three weights, you could derive those by hand in five minutes. For a network with a hundred million parameters, you cannot. You'd still be doing calculus when the sun goes out. What you need is an algorithm — mechanical, efficient, correct — that computes every derivative in a single sweep.

That algorithm is backpropagation, and the way to understand it is to remember the plot of Memento. The hero wakes up at the end of a day and something has gone wrong. He can't rewind. What he can do is walk through what happened in reverse, stopping at each moment to ask: how much did this step contribute to where I ended up? The answers pile up on Polaroid notes — small, local, one per moment. That's backprop. The forward pass is the day. The loss is the ending. The chain rule is how you read the day backwards.

Start small, because the trick is the same at every scale. Take one function composed of other functions — say L = sin((3x + 2)²). You want dL/dx. You know how to do this by hand: introduce names for the intermediates, take each local derivative, multiply them in a chain. That's the chain rule. It is the whole algorithm. Everything after this is bookkeeping.

chain rule, scalar form
Let  u = 3x + 2 ,   v = u²  ,  L = sin v .

dL         dL     dv     du
──    =    ──  ·  ──  ·  ──
dx         dv     du     dx

         = cos v  ·  2u  ·  3

Three local derivatives — one per operation — multiplied together. Each factor is a Polaroid: the slope of this step with respect to the step before it. None of them know the whole story. They don't have to. Line the notes up in order, multiply, and you've reconstructed the slope of the entire composition.

Slide x below. Forward values (cyan) flow left to right — that's the day happening. Backward gradients (rose) accumulate right to left — that's you reading it backwards, Polaroid by Polaroid.

chain rule — a scalar walkthrough
L = sin( (3x + 2)² )
x
input
u = 3x + 2
affine
v = u²
square
L = sin(v)
loss
value
0.4000
value
3.2000
value
10.2400
value
-0.7279
du/dx
3
dv/du = 2u
6.4000
dL/dv = cos v
-0.6857
dL/dx
-13.1656
dL/du = dL/dv · dv/du
-4.3885
dL/dv
-0.6857
dL/dL
1.0000
forward values (left → right)
local derivative at each node
backward gradients (right → left)
the rule, restated
dL/dx = dL/dv · dv/du · du/dx
= -0.6857 · 6.4000 · 3 = -13.1656
Chain rule (personified)
Take the derivative of the outside, times the derivative of the inside. Do it again, recursively, until you're at the variable you wanted. I am simple. I am local. I do not know how deep your network is and I do not care. A thousand tiny local derivatives multiplied together is still just multiplication.

Generalise. Most real functions aren't a simple composition — they're a graph. One intermediate value might feed into several later nodes; gradients from different paths have to be summed. The chain rule still works; you just have to keep the graph structure honest. That structure is the computation graph, and every deep-learning framework — PyTorch, TensorFlow, JAX — builds one invisibly every time you run a forward pass. They aren't doing anything clever. They're taking notes.

Click next step. The forward pass fills in values node by node, left to right — that's the day. Once the loss L exists, the backward pass starts from dL/dL = 1 and rewinds, each edge contributing its own Polaroid — the local derivative of what this node does to what flowed through it.

computation graph — forward values + backward gradients
L = (a · b + c) · d·forward pass
a=2b=3c=-1d=4×a·b+ab+c×L
step 0 / 8
fill values left → right

Time to do this on a neuron. One neuron, three weights, one bias, sigmoid activation, MSE loss. Forward pass on top; each backward line is one Polaroid, placed in the order you read them.

backprop through a single sigmoid neuron
Forward:      z  =  w·x + b
              ŷ  =  σ(z)
              L  =  ½ (ŷ − y)²

Backward:     dL/dŷ   =   ŷ − y
              dŷ/dz   =   σ(z) · (1 − σ(z))
              dL/dz   =   (dL/dŷ) · (dŷ/dz)

              dL/dwᵢ  =   (dL/dz) · xᵢ
              dL/db   =    dL/dz

Read top to bottom. Each row uses the one above it — you cannot read page three of the ledger before page two has been written. The last two lines are the ones you actually update with: both are multiples of dL/dz, because the chain rule kept the shared intermediate around for you. That shared quantity has a name, and the rest of the deep-learning literature won't shut up about it.

Hit one step in the widget. Every number updates live — weights drift toward the target y=1, the loss collapses. The neuron is learning, which is a fancier way of saying: it's reading its own mistakes backwards and editing itself.

one full step of backprop + gradient descent
ŷ = σ(w·x + b)·L = ½(ŷ − y)²
forward pass
x·w=0.80·0.200=0.160
x·w=-0.40·-0.100=0.040
x·w=1.20·0.300=0.360
z=0.6600
ŷ = σ(z)=0.6593
L=0.0581
backward pass
dL/dŷ=ŷ − y-0.3407
dŷ/dz=σ(z)(1−σ(z))0.2246
dL/dz=(dL/dŷ)(dŷ/dz)-0.0765
dL/dw₁=(dL/dz)·x₁-0.0612
dL/dw₂=(dL/dz)·x₂0.0306
dL/dw₃=(dL/dz)·x₃-0.0919
dL/db=dL/dz-0.0765
update (applied when you hit "one step")
w0.2000.50 · -0.06120.231
w-0.1000.50 · 0.0306-0.115
w0.3000.50 · -0.09190.346
b0.1000.50 · -0.07650.138
ŷ0.6593
target1.00
loss0.0581
Gotchas

Forward pass first, always. Backprop reads intermediates (z, ŷ) that only exist after the forward pass wrote them down. No notes, no rewind. In PyTorch, calling loss.backward() before any forward ops raises.

Gradients accumulate unless you zero them. By default PyTorch adds every backward() call's gradients onto .grad. That's deliberate — useful for accumulating gradients across micro-batches — and a footgun for everyone else. Always call optimizer.zero_grad() before your backward pass. Forgetting this is the second most common PyTorch bug, and the first most common one it took you two days to find.

“With respect to what” matters. PyTorch only computes gradients for tensors with requires_grad=True. If your weights don't have it, no gradient is stored and the optimizer sees zeros — silent failure, model doesn't learn, you blame the data. nn.Parameter and nn.Linear set this for you, but only inside an nn.Module.

Three layers. Pure Python spells every multiplication out — there's nowhere for a bug to hide. NumPy vectorises it across a batch. PyTorch lets you write only the forward pass; autograd reads the day backwards for you.

layer 1 — pure python · one_neuron_backprop.py
python
import math

def sigmoid(z): return 1.0 / (1.0 + math.exp(-z))

def train_one_neuron(x, target, lr=0.5, steps=50):
    w = [0.2, -0.1, 0.3]
    b = 0.1
    for step in range(steps + 1):
        # Forward
        z = sum(xi * wi for xi, wi in zip(x, w)) + b
        yhat = sigmoid(z)
        loss = 0.5 * (yhat - target) ** 2

        # Backward
        dL_dyhat = yhat - target                # d(½(yhat-t)²)/dyhat
        dyhat_dz = yhat * (1 - yhat)            # σ'(z)
        dL_dz = dL_dyhat * dyhat_dz
        dL_dw = [dL_dz * xi for xi in x]        # one gradient per weight
        dL_db = dL_dz

        if step % 20 == 0:
            print(f"step {step}: loss={loss:.4f}")

        # Update
        w = [wi - lr * gi for wi, gi in zip(w, dL_dw)]
        b = b - lr * dL_db
    return w, b, yhat

w, b, yhat = train_one_neuron([0.8, -0.4, 1.2], target=1.0)
print(f"final yhat = {yhat:.4f}")
stdout
step 0: loss=0.3263
step 20: loss=0.0091
step 40: loss=0.0026
final yhat = 0.9282
layer 2 — numpy · one_neuron_backprop_numpy.py
python
import numpy as np

def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))

def train_one_neuron(X, y, lr=0.5, steps=50):
    """X: (N, D)   y: (N,)"""
    N, D = X.shape
    w = np.array([0.2, -0.1, 0.3])
    b = 0.1
    for _ in range(steps):
        # Forward — whole batch at once
        z = X @ w + b                           # (N,)
        yhat = sigmoid(z)                       # (N,)

        # Backward
        dL_dyhat = (yhat - y) / N               # (N,)  — mean over batch
        dyhat_dz = yhat * (1 - yhat)            # (N,)
        dL_dz = dL_dyhat * dyhat_dz             # (N,)

        dL_dw = X.T @ dL_dz                     # (D,)  — chain rule for free
        dL_db = dL_dz.sum()

        w = w - lr * dL_dw
        b = b - lr * dL_db
    return w, b
pure python → numpy
dL_dw = [dL_dz * xi for xi in x]←→dL_dw = X.T @ dL_dz

outer loop over weights becomes matrix transpose

one example←→a whole batch, one call

gradients average across the batch automatically

sigmoid grad = yhat * (1 - yhat)←→same — NumPy broadcasts over the batch

the local derivative formula is unchanged

layer 3 — pytorch · one_neuron_backprop_pytorch.py
python
import torch
import torch.nn as nn

class OneNeuron(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(3, 1)

    def forward(self, x):
        return torch.sigmoid(self.fc(x))

model = OneNeuron()
# Hand-set the same starting params as in layer 1 for demo parity.
with torch.no_grad():
    model.fc.weight.copy_(torch.tensor([[0.2, -0.1, 0.3]]))
    model.fc.bias.copy_(torch.tensor([0.1]))

x = torch.tensor([[0.8, -0.4, 1.2]])
y = torch.tensor([[1.0]])
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for step in range(51):
    optimizer.zero_grad()            # zero the gradient buffer
    yhat = model(x)                  # forward
    loss = 0.5 * (yhat - y).pow(2).mean()
    loss.backward()                  # autograd runs the chain rule
    optimizer.step()                 # apply w ← w - α·∇L
    if step % 20 == 0:
        print(f"step {step}: loss={loss.item():.4f}")

print(f"final yhat = {model(x).item():.4f}")
stdout
step 0: loss=0.3263
step 20: loss=0.0091
final yhat = 0.9282
numpy → pytorch
dL_dz = (yhat - y) * yhat * (1 - yhat)←→loss.backward()

autograd computes identical values from the forward expression

w = w - lr * dL_dw←→optimizer.step()

we already knew this — now the gradients are autograd-computed

grads reset implicitly each step←→optimizer.zero_grad()

PyTorch accumulates by default — be explicit about zeroing

Backprop a 2-input XOR neuron, step by step

Using the pure-Python version above, train a neuron on the XOR dataset [(0,0,0), (0,1,1), (1,0,1), (1,1,0)]. Run 2000 steps. Print the loss every 200. It will not drop below about 0.125 — that's the ceiling one neuron can reach on XOR (75% accuracy). The chain rule is doing its job perfectly; the model is fundamentally too shallow. Backprop can't save a model that lacks the capacity to express the answer.

Bonus: stack a hidden layer of two neurons and an output neuron, and redo it from scratch with pure-Python backprop. It gets ugly fast. Every line of the next lesson exists because this exercise becomes unbearable at more than one layer.

What to carry forward. Backprop is the chain rule, applied backwards through a computation graph. Forward pass writes the ledger: values and intermediates. Backward pass reads it in reverse, multiplying local derivatives edge by edge. Every parameter gradient turns out to be δ at its node times a trivial local quantity, and computing every δ costs about the same as the forward pass itself — linear in the size of the graph. That efficiency is why deep learning is computationally practical; without it, training a modern network would cost more than building one.

Next up — Multi-Layer Backpropagation. One layer, one walk backwards. Real networks are ten, twenty, a hundred layers deep. Every additional layer is another page in the ledger — and the recursion “δ at layer equals δ at layer ℓ+1 times the local derivative” is the core equation of deep learning, the thing that makes arbitrarily deep networks trainable in the first place.

References