Backpropagation

The chain rule, made mechanical.

Medium

~15 min read

·lesson 2 of 6

You have a neuron. It takes an input, multiplies by weights, adds a bias, squashes the result through a non-linearity, spits out a prediction. You already know what to do next: nudge the weights to make the loss smaller. The only thing missing is the nudge direction — the partial derivative of the loss with respect to every parameter in the model.

For one neuron with three weights, you could derive those by hand in five minutes. For a network with a hundred million parameters, you cannot. You'd still be doing calculus when the sun goes out. What you need is an algorithm — mechanical, efficient, correct — that computes every derivative in a single sweep.

That algorithm is backpropagation, and the way to understand it is to remember the plot of Memento. The hero wakes up at the end of a day and something has gone wrong. He can't rewind. What he can do is walk through what happened in reverse, stopping at each moment to ask: how much did this step contribute to where I ended up? The answers pile up on Polaroid notes — small, local, one per moment. That's backprop. The forward pass is the day. The loss is the ending. The chain rule is how you read the day backwards.

Start small, because the trick is the same at every scale. Take one function composed of other functions — say L = sin((3x + 2)²). You want dL/dx. You know how to do this by hand: introduce names for the intermediates, take each local derivative, multiply them in a chain. That's the chain rule. It is the whole algorithm. Everything after this is bookkeeping.

chain rule, scalar form

Let  u = 3x + 2 ,   v = u²  ,  L = sin v .

dL         dL     dv     du
──    =    ──  ·  ──  ·  ──
dx         dv     du     dx

         = cos v  ·  2u  ·  3

Three local derivatives — one per operation — multiplied together. Each factor is a Polaroid: the slope of this step with respect to the step before it. None of them know the whole story. They don't have to. Line the notes up in order, multiply, and you've reconstructed the slope of the entire composition.

Slide x below. Forward values (cyan) flow left to right — that's the day happening. Backward gradients (rose) accumulate right to left — that's you reading it backwards, Polaroid by Polaroid.

chain rule — a scalar walkthrough

L = sin( (3x + 2)² )

input

u = 3x + 2

affine

v = u²

square

L = sin(v)

loss

value

0.4000

value

3.2000

value

10.2400

value

-0.7279

du/dx

dv/du = 2u

6.4000

dL/dv = cos v

-0.6857

dL/dx

-13.1656

dL/du = dL/dv · dv/du

-4.3885

dL/dv

-0.6857

dL/dL

1.0000

forward values (left → right)

local derivative at each node

backward gradients (right → left)

the rule, restated

dL/dx = dL/dv · dv/du · du/dx

= -0.6857 · 6.4000 · 3 = -13.1656

x0.40

Chain rule (personified)

Take the derivative of the outside, times the derivative of the inside. Do it again, recursively, until you're at the variable you wanted. I am simple. I am local. I do not know how deep your network is and I do not care. A thousand tiny local derivatives multiplied together is still just multiplication.

Generalise. Most real functions aren't a simple composition — they're a graph. One intermediate value might feed into several later nodes; gradients from different paths have to be summed. The chain rule still works; you just have to keep the graph structure honest. That structure is the computation graph, and every deep-learning framework — PyTorch, TensorFlow, JAX — builds one invisibly every time you run a forward pass. They aren't doing anything clever. They're taking notes.

Click next step. The forward pass fills in values node by node, left to right — that's the day. Once the loss L exists, the backward pass starts from dL/dL = 1 and rewinds, each edge contributing its own Polaroid — the local derivative of what this node does to what flowed through it.

computation graph — forward values + backward gradients

L = (a · b + c) · d·forward pass

step 0 / 8

fill values left → right

Time to do this on a neuron. One neuron, three weights, one bias, sigmoid activation, MSE loss. Forward pass on top; each backward line is one Polaroid, placed in the order you read them.

backprop through a single sigmoid neuron

Forward:      z  =  w·x + b
              ŷ  =  σ(z)
              L  =  ½ (ŷ − y)²

Backward:     dL/dŷ   =   ŷ − y
              dŷ/dz   =   σ(z) · (1 − σ(z))
              dL/dz   =   (dL/dŷ) · (dŷ/dz)

              dL/dwᵢ  =   (dL/dz) · xᵢ
              dL/db   =    dL/dz

Read top to bottom. Each row uses the one above it — you cannot read page three of the ledger before page two has been written. The last two lines are the ones you actually update with: both are multiples of dL/dz, because the chain rule kept the shared intermediate around for you. That shared quantity has a name, and the rest of the deep-learning literature won't shut up about it.

Hit one step in the widget. Every number updates live — weights drift toward the target y=1, the loss collapses. The neuron is learning, which is a fancier way of saying: it's reading its own mistakes backwards and editing itself.

one full step of backprop + gradient descent

ŷ = σ(w·x + b)·L = ½(ŷ − y)²

forward pass

x₁·w₁=0.80·0.200=0.160

x₂·w₂=-0.40·-0.100=0.040

x₃·w₃=1.20·0.300=0.360

z=0.6600

ŷ = σ(z)=0.6593

L=0.0581

backward pass

dL/dŷ=ŷ − y-0.3407

dŷ/dz=σ(z)(1−σ(z))0.2246

dL/dz=(dL/dŷ)(dŷ/dz)-0.0765

dL/dw₁=(dL/dz)·x₁-0.0612

dL/dw₂=(dL/dz)·x₂0.0306

dL/dw₃=(dL/dz)·x₃-0.0919

dL/db=dL/dz-0.0765

update (applied when you hit "one step")

w₁0.200− 0.50 · -0.0612 → 0.231

w₂-0.100− 0.50 · 0.0306 → -0.115

w₃0.300− 0.50 · -0.0919 → 0.346

b0.100− 0.50 · -0.0765 → 0.138

lr α0.50

ŷ0.6593

target1.00

loss0.0581

Gotchas

Forward pass first, always. Backprop reads intermediates (z, ŷ) that only exist after the forward pass wrote them down. No notes, no rewind. In PyTorch, calling loss.backward() before any forward ops raises.

Gradients accumulate unless you zero them. By default PyTorch adds every backward() call's gradients onto .grad. That's deliberate — useful for accumulating gradients across micro-batches — and a footgun for everyone else. Always call optimizer.zero_grad() before your backward pass. Forgetting this is the second most common PyTorch bug, and the first most common one it took you two days to find.

“With respect to what” matters. PyTorch only computes gradients for tensors with requires_grad=True. If your weights don't have it, no gradient is stored and the optimizer sees zeros — silent failure, model doesn't learn, you blame the data. nn.Parameter and nn.Linear set this for you, but only inside an nn.Module.

Three layers. Pure Python spells every multiplication out — there's nowhere for a bug to hide. NumPy vectorises it across a batch. PyTorch lets you write only the forward pass; autograd reads the day backwards for you.

layer 1 — pure python · one_neuron_backprop.py

python

import math

def sigmoid(z): return 1.0 / (1.0 + math.exp(-z))

def train_one_neuron(x, target, lr=0.5, steps=50):
    w = [0.2, -0.1, 0.3]
    b = 0.1
    for step in range(steps + 1):
        # Forward
        z = sum(xi * wi for xi, wi in zip(x, w)) + b
        yhat = sigmoid(z)
        loss = 0.5 * (yhat - target) ** 2

        # Backward
        dL_dyhat = yhat - target                # d(½(yhat-t)²)/dyhat
        dyhat_dz = yhat * (1 - yhat)            # σ'(z)
        dL_dz = dL_dyhat * dyhat_dz
        dL_dw = [dL_dz * xi for xi in x]        # one gradient per weight
        dL_db = dL_dz

        if step % 20 == 0:
            print(f"step {step}: loss={loss:.4f}")

        # Update
        w = [wi - lr * gi for wi, gi in zip(w, dL_dw)]
        b = b - lr * dL_db
    return w, b, yhat

w, b, yhat = train_one_neuron([0.8, -0.4, 1.2], target=1.0)
print(f"final yhat = {yhat:.4f}")

stdout

step 0: loss=0.3263
step 20: loss=0.0091
step 40: loss=0.0026
final yhat = 0.9282

layer 2 — numpy · one_neuron_backprop_numpy.py

python

import numpy as np

def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))

def train_one_neuron(X, y, lr=0.5, steps=50):
    """X: (N, D)   y: (N,)"""
    N, D = X.shape
    w = np.array([0.2, -0.1, 0.3])
    b = 0.1
    for _ in range(steps):
        # Forward — whole batch at once
        z = X @ w + b                           # (N,)
        yhat = sigmoid(z)                       # (N,)

        # Backward
        dL_dyhat = (yhat - y) / N               # (N,)  — mean over batch
        dyhat_dz = yhat * (1 - yhat)            # (N,)
        dL_dz = dL_dyhat * dyhat_dz             # (N,)

        dL_dw = X.T @ dL_dz                     # (D,)  — chain rule for free
        dL_db = dL_dz.sum()

        w = w - lr * dL_dw
        b = b - lr * dL_db
    return w, b

pure python → numpy

dL_dw = [dL_dz * xi for xi in x]←→dL_dw = X.T @ dL_dz

— outer loop over weights becomes matrix transpose

one example←→a whole batch, one call

— gradients average across the batch automatically

sigmoid grad = yhat * (1 - yhat)←→same — NumPy broadcasts over the batch

— the local derivative formula is unchanged

layer 3 — pytorch · one_neuron_backprop_pytorch.py

python

import torch
import torch.nn as nn

class OneNeuron(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(3, 1)

    def forward(self, x):
        return torch.sigmoid(self.fc(x))

model = OneNeuron()
# Hand-set the same starting params as in layer 1 for demo parity.
with torch.no_grad():
    model.fc.weight.copy_(torch.tensor([[0.2, -0.1, 0.3]]))
    model.fc.bias.copy_(torch.tensor([0.1]))

x = torch.tensor([[0.8, -0.4, 1.2]])
y = torch.tensor([[1.0]])
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for step in range(51):
    optimizer.zero_grad()            # zero the gradient buffer
    yhat = model(x)                  # forward
    loss = 0.5 * (yhat - y).pow(2).mean()
    loss.backward()                  # autograd runs the chain rule
    optimizer.step()                 # apply w ← w - α·∇L
    if step % 20 == 0:
        print(f"step {step}: loss={loss.item():.4f}")

print(f"final yhat = {model(x).item():.4f}")

stdout

step 0: loss=0.3263
step 20: loss=0.0091
final yhat = 0.9282

numpy → pytorch

dL_dz = (yhat - y) * yhat * (1 - yhat)←→loss.backward()

— autograd computes identical values from the forward expression

w = w - lr * dL_dw←→optimizer.step()

— we already knew this — now the gradients are autograd-computed

grads reset implicitly each step←→optimizer.zero_grad()

— PyTorch accumulates by default — be explicit about zeroing

Backprop a 2-input XOR neuron, step by step

Using the pure-Python version above, train a neuron on the XOR dataset [(0,0,0), (0,1,1), (1,0,1), (1,1,0)]. Run 2000 steps. Print the loss every 200. It will not drop below about 0.125 — that's the ceiling one neuron can reach on XOR (75% accuracy). The chain rule is doing its job perfectly; the model is fundamentally too shallow. Backprop can't save a model that lacks the capacity to express the answer.

Bonus: stack a hidden layer of two neurons and an output neuron, and redo it from scratch with pure-Python backprop. It gets ugly fast. Every line of the next lesson exists because this exercise becomes unbearable at more than one layer.

What to carry forward. Backprop is the chain rule, applied backwards through a computation graph. Forward pass writes the ledger: values and intermediates. Backward pass reads it in reverse, multiplying local derivatives edge by edge. Every parameter gradient turns out to be δ at its node times a trivial local quantity, and computing every δ costs about the same as the forward pass itself — linear in the size of the graph. That efficiency is why deep learning is computationally practical; without it, training a modern network would cost more than building one.

Next up — Multi-Layer Backpropagation. One layer, one walk backwards. Real networks are ten, twenty, a hundred layers deep. Every additional layer is another page in the ledger — and the recursion “δ at layer ℓ equals δ at layer ℓ+1 times the local derivative” is the core equation of deep learning, the thing that makes arbitrarily deep networks trainable in the first place.

References

[01]
Learning representations by back-propagating errors
Rumelhart, Hinton, Williams · Nature, 1986 — the paper that made backprop famous
[02]
Calculus on Computational Graphs: Backpropagation
Christopher Olah · colah.github.io, 2015
[03]
Dive into Deep Learning — 5.3 Forward Propagation, Backward Propagation, and Computational Graphs
Zhang, Lipton, Li, Smola · d2l.ai