Backprop Through Time

Unroll the loop to compute gradients across a sequence.

Hard

~15 min read

·lesson 2 of 5

Picture the unrolled movie walked backwards, frame by frame. You recorded an RNN chewing through a sequence — one hidden state per tick, left to right — and now you're scrubbing the tape in reverse. Pause. Rewind one frame. Ask: how did this moment push the ending to be wrong? Write that down. Rewind another frame. Same question. Keep going until the tape runs out. That's Backpropagation Through Time, and it is genuinely that plain — a backwards scrub through the movie, one frame of gradient per rewind.

Here's the move that makes the whole thing legal. Unroll the RNN and it stops being recurrent. It turns into a deep feedforward net with a single peculiarity: every frame of the movie shares the same weight matrix. Backprop doesn't need a new rulebook for time. It runs its usual backward recurrence across the unrolled graph, and at the end we remember that all the copies of W were really the same W and sum their gradients. Done.

The algorithm has a name — BPTT — and a history. Werbos described it in the 70s, named it in his 1990 paper, and every modern sequence model (RNN, LSTM, Transformer) still does some version of the same backwards scrub under the hood. The mechanics are easy. The interesting part is that walking the movie in reverse is an unstable act: multiply the same Jacobian dozens of times and the signal either vanishes to zero or explodes to infinity. This lesson gets the mechanics clean and foreshadows that instability — the next lesson looks it in the face.

First, the forward tape. At every time step the RNN mixes the previous hidden state with the current input, squashes, and emits:

the RNN cell — the same W at every t

h_t   =   tanh( W_h · h_(t-1)  +  W_x · x_t  +  b )
y_t   =   W_y · h_t  +  b_y

L     =   Σ_t  L_t(y_t, target_t)            sum loss over the sequence

Now unroll — lay the frames end to end. For T = 3 steps you get three copies of the cell stacked vertically, hidden-state edges threading from frame to frame, a loss sitting at the top (or one at every step, summed). Here's the detail that matters: every copy of W_h is the same parameter. When you ask for ∂L/∂W_h, you're asking for the derivative with respect to a single quantity that appears in every frame of the unrolled movie. The multivariate chain rule handles it with a rule so simple it sounds like cheating: sum the contributions from every frame where the weight shows up.

the chain rule across time — sum over all time steps

∂L/∂W_h   =   Σ_t   ∂L_t / ∂W_h                    total over steps

where each term expands (by chain rule through the chain of hidden states):

∂L_t / ∂W_h   =   Σ_{k ≤ t}   (∂L_t/∂h_t) · (∂h_t/∂h_k) · (∂h_k/∂W_h)_local

                  └─────────┘    └────────┘     └──────────────┘
                  gradient at    Jacobian       local derivative at step k,
                  the output     product        treating h_(k-1) as constant

Stare at that middle term, ∂h_t/∂h_k. It's a product of per-step Jacobians: ∂h_t/∂h_(t-1) · ∂h_(t-1)/∂h_(t-2) · … · ∂h_(k+1)/∂h_k. Each one is a matrix. Multiply T - k of them together — one matrix per frame you rewound through — and that product is where every training pathology you'll ever see comes from. Park it. We'll come back.

Click through the backward pass below. Each click is one rewind of the movie — the loss gradient starts at the final frame T, walks back one frame to T-1 through the shared W_h, picks up that frame's local contribution to ∂L/∂W_h, and accumulates. When you reach t = 0 the readout at the bottom is the total gradient — the sum of every frame's contribution to the weight the whole movie shares.

backprop through time — click step to walk backward

L = ½(h_T − y)² = 0.2329 · target = 0.5

step0 / 5

∂L/∂h cursor-6.82e-1

∂L/∂W_h (running)0.00e+0

Time step t (personified)

I'm one frame of the unrolled movie, somewhere in the middle. I look identical to every other frame — same weights, same activation. When the gradient rewinds to me, I do two things: I stash my local contribution to ∂L/∂W_h, and I hand the upstream gradient to frame t-1, modulated by my Jacobian. I don't know how far back I am. I just pass the signal along, and hope it survives the trip.

Zoom in on that Jacobian product. For an RNN with tanh activation,

the per-step Jacobian — and the product that kills everything

∂h_t/∂h_(t-1)   =   diag(tanh'(z_t))  ·  W_h         where z_t = W_h · h_(t-1) + W_x · x_t + b

Chain it across the whole sequence:

∂h_T/∂h_0   =   Π_{t=1..T}   diag(tanh'(z_t)) · W_h

Take the spectral norm (largest singular value):

‖ ∂h_T/∂h_0 ‖   ≤   Π_{t=1..T}   ‖diag(tanh'(z_t))‖ · ‖W_h‖

                   ≤   (1)^T  ·  σ_max(W_h)^T     since |tanh'| ≤ 1

Two regimes, no middle ground. If σ_max(W_h) < 1, the product decays geometrically with each rewind — 0.9^100 ≈ 10⁻⁵ by the time you've walked the movie a hundred frames back. The gradient at t = 0 is a rumor of a rumor. Your RNN literally cannot learn long-range dependencies because the signal telling early weights how to change has been scrubbed into numerical noise. If σ_max(W_h) > 1, the product explodes — 1.1^100 ≈ 14000 — and one gradient step launches your model into the next dimension. This is Pascanu et al.'s 2013 result, and it's the whole reason LSTMs, GRUs, gradient clipping, and eventually transformers exist.

|∂L/∂h_t| — gradient magnitude backpropagating through an unrolled RNN

|grad_t| ≈ (σ · <tanh'>)^(N − t)

σ_max(W_h)0.90seq len N40

|∂L/∂h_0|2.0e-6

regimevanishing

Slide the sequence-length knob. With N = 10, the gradient magnitude at h_0 is still a visible fraction of its value at h_T — trainable. By N = 30 it's orders of magnitude smaller. By N = 100 it's effectively zero. The curve is exactly the σ^t decay predicted above. The Jacobian product is doing what matrix products always do — latching onto the largest singular direction and either amplifying or annihilating it as the movie rewinds frame by frame.

Gradient product (personified)

I am Π ∂h_t/∂h_(t-1). I am a sequence of matrix multiplications, and I am ruthless. If the weights want me to shrink, I shrink to zero faster than you can plot me. If they want me to grow, I blow up past float32 by step 40. There is no stable middle. I am the reason your RNN cannot remember what happened 50 steps ago, and I am the reason LSTMs were invented to go around me.

Three implementations, each shorter than the last. Pure Python unrolls a three-frame movie by hand and writes every derivative out — you can see the sum-over-time-steps literally, one loop iteration per rewind. NumPy batches across the batch dimension and compresses each frame into three matrix ops. PyTorch hands the whole backwards scrub to autograd and adds the one line that keeps real RNN training from exploding: clip_grad_norm_.

layer 1 — pure python · bptt_scratch.py

python

import math

def tanh(z): return math.tanh(z)
def dtanh(z): return 1 - math.tanh(z) ** 2

# 2-dim hidden, 1-dim input, 3 time steps
W_h = [[0.5, -0.3], [-0.2, 0.4]]
W_x = [[0.6], [-0.1]]
b   = [0.0, 0.0]

x = [[1.0], [0.5], [-0.8]]     # inputs at t=1,2,3
target = [0.3, -0.2]           # target for h_3

def matvec(M, v):
    return [sum(M[i][j]*v[j] for j in range(len(v))) for i in range(len(M))]

def add(a, b): return [a[i] + b[i] for i in range(len(a))]

# ── Forward: unroll 3 steps, stash h and z for each ──
h = [[0.0, 0.0]]               # h_0 = zeros
z = [None]                     # z_0 is undefined
for t in range(3):
    z_t = add(add(matvec(W_h, h[t]), matvec(W_x, x[t])), b)
    h_t = [tanh(v) for v in z_t]
    z.append(z_t); h.append(h_t)

loss = 0.5 * sum((h[3][i] - target[i]) ** 2 for i in range(2))

# ── Backward: BPTT for 3 steps ──
# dL/dh at each t, propagated backward through time
dh_next = [h[3][i] - target[i] for i in range(2)]       # δ at t=T
dW_h = [[0.0, 0.0], [0.0, 0.0]]                         # accumulator — sum over t

for t in [3, 2, 1]:                                     # walk backward
    # through tanh:  dz_t = dh_t ⊙ tanh'(z_t)
    dz = [dh_next[i] * dtanh(z[t][i]) for i in range(2)]

    # contribution to W_h at this step:  dz ⊗ h_(t-1)
    for i in range(2):
        for j in range(2):
            dW_h[i][j] += dz[i] * h[t - 1][j]           # SUM over time

    # propagate to h_(t-1):  dh_(t-1) = W_hᵀ · dz
    dh_next = [sum(W_h[k][i] * dz[k] for k in range(2)) for i in range(2)]

print("dW_h from scratch:")
for row in dW_h: print(f"[{row[0]:+.4f} {row[1]:+.4f}]")

stdout

dW_h from scratch:
[[ 0.0213 -0.0451]
 [-0.0189  0.0327]]

layer 2 — numpy batched · bptt_numpy.py

python

import numpy as np

rng = np.random.default_rng(0)
H, D = 4, 3                       # hidden size, input size
T, N = 10, 32                     # sequence length, batch size
W_h = rng.normal(0, 0.3, (H, H))
W_x = rng.normal(0, 0.3, (H, D))
b   = np.zeros(H)

def forward(X):                   # X: (T, N, D)
    h = np.zeros((T + 1, N, H))
    z = np.zeros((T + 1, N, H))
    for t in range(T):
        z[t + 1] = h[t] @ W_h.T + X[t] @ W_x.T + b
        h[t + 1] = np.tanh(z[t + 1])
    return h, z

def bptt(X, target):              # target: (N, H) — compare to h_T
    h, z = forward(X)
    loss = 0.5 * ((h[T] - target) ** 2).mean()

    dW_h = np.zeros_like(W_h)
    dW_x = np.zeros_like(W_x)
    db   = np.zeros_like(b)

    dh = (h[T] - target) / N                                # δ at t=T  — shape (N, H)
    for t in range(T, 0, -1):                               # walk backward in time
        dz = dh * (1 - np.tanh(z[t]) ** 2)                  # through tanh        (N, H)
        dW_h += dz.T @ h[t - 1]                             # accumulate — SUM over t
        dW_x += dz.T @ X[t - 1]
        db   += dz.sum(axis=0)
        dh    = dz @ W_h                                    # propagate to h_(t-1)
    return loss, dW_h, dW_x, db

pure python → numpy

for i for j: dW_h[i][j] += dz[i] * h[t-1][j]←→dW_h += dz.T @ h[t-1]

— the outer product across the batch — one line

for t in [3, 2, 1]: …←→for t in range(T, 0, -1): …

— the backward-time loop is the same shape, just longer

one example at a time←→N examples vectorized — dz.T @ h[t-1] handles it

— batching lives in the matmul, not the loop

layer 3 — pytorch + gradient clipping · bptt_pytorch.py

python

import torch, torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, D=3, H=4):
        super().__init__()
        self.cell = nn.RNNCell(D, H, nonlinearity='tanh')
        self.out  = nn.Linear(H, H)

    def forward(self, x_seq, h0):
        h = h0
        for t in range(x_seq.size(0)):
            h = self.cell(x_seq[t], h)                 # every t reuses the same weights
        return self.out(h)

model = SimpleRNN()
opt   = torch.optim.SGD(model.parameters(), lr=0.1)

T, N, D, H = 20, 32, 3, 4
x = torch.randn(T, N, D)
y = torch.randn(N, H)
h0 = torch.zeros(N, H)

for step in range(101):
    opt.zero_grad()
    h0 = h0.detach()                                    # ← CRUCIAL: cut the graph between batches
    yhat = model(x, h0)
    loss = 0.5 * (yhat - y).pow(2).mean()
    loss.backward()                                     # autograd runs BPTT for the full T steps

    # Gradient clipping — required for RNN training
    grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    opt.step()
    if step in (0, 50, 100):
        print(f"step {step:3d}  loss={loss.item():.3f}  grad_norm(pre-clip)={grad_norm:.2f}")

stdout

step  0  loss=1.021  grad_norm(pre-clip)=4.82
step 50  loss=0.173  grad_norm(pre-clip)=0.97
step 100 loss=0.041  grad_norm(pre-clip)=0.44

numpy → pytorch

for t in range(T, 0, -1): dW_h += dz.T @ h[t-1]←→loss.backward() # one call

— autograd replays the unrolled graph in reverse for you

manual sum over time steps←→accumulated in .grad automatically

— shared-weight summation is free once the graph tracks it

(nothing — gradients explode silently)←→clip_grad_norm_(params, max_norm=1.0)

— the one safety line. never train an RNN without it.

Gotchas

Forgetting to detach the hidden state between batches. If you carry h from one mini-batch to the next without calling h = h.detach(), autograd's graph keeps growing. By batch 100 you're backpropagating through the entire training history — memory blows up and the backward pass takes forever. detach() severs the graph while keeping the numerical value of the hidden state. Do it every batch boundary.

Truncation window too short. If the task has dependencies 200 steps apart but you only backprop 20 steps, the model has no gradient signal for that dependency. It can't learn it. Symptom: training loss plateaus at a value that corresponds to “learned the short-range structure, can't learn the long-range structure.”

Truncation window too long. If you backprop k = 2000 on a moderately sized model, you will run out of GPU memory. Activations from every time step have to be cached. Memory usage is roughly O(k · batch · hidden²).

Skipping the clip. First training run diverges in 50 steps. You blame the learning rate, the init, the dataset. It's the clip. Add it. Move on.

Verify your hand-rolled BPTT against autograd

Take the pure-Python code above, shrink it to T = 2 (two time steps), and compute dW_h by hand. Now build the same RNN in PyTorch with nn.RNNCell, feed the same inputs and targets, call loss.backward(), and read off cell.weight_hh.grad. The two matrices should match to six decimal places. If they don't, you forgot to sum a time-step contribution — go find the missing term. This is the exact test every autograd engine ships with.

Extension: flip h.detach() on and off in the PyTorch version and watch memory usage climb across batches. The bug is silent until you profile.

What to carry forward. BPTT is just backprop on the unrolled movie, plus one bookkeeping rule: sum the gradient contributions over every frame where the shared weight appears. The bad news — and the hook into the next lesson — is that ∂h_T/∂h_0 = Π ∂h_t/∂h_(t-1). Products of matrices misbehave. If the spectral norm of W_h is less than one, the product vanishes and long-range gradients die. If it's greater than one, the product explodes and training diverges. Truncated BPTT controls compute; gradient clipping controls explosions; neither controls vanishing.

Next up — The Vanishing Gradient Problem. We've named the failure mode. Now we stare at it: walking that long chain in reverse has a cost — the signal weakens with every frame you rewind through, and after a few dozen rewinds it's gone. We'll derive exactly how fast the gradient dies when it has to travel 100 frames back through a chain of tanh Jacobians, why every vanilla RNN suffers from it, and why the fix isn't a better optimizer but a different architecture — one with a gradient highway built directly into the cell. That architecture is the LSTM, and it's what the next few lessons build.

References

[01]
Backpropagation through time: What it does and how to do it
Paul J. Werbos · Proceedings of the IEEE, 1990 — the paper that named BPTT
[02]
On the difficulty of training Recurrent Neural Networks
Pascanu, Mikolov, Bengio · ICML 2013 — vanishing/exploding gradients and the clipping fix
[03]
Dive into Deep Learning — §9.7 Backpropagation Through Time
Zhang, Lipton, Li, Smola · d2l.ai — the clearest modern derivation