Recurrent Neural Network

Hidden state, shared weights, sequential processing.

Medium
~15 min read
·lesson 1 of 5

Picture a note-taker sitting at a desk with exactly one sticky note. A stream of words is being read out loud, one at a time. The note-taker is not allowed to write a second sticky. They can't flip the sticky over. They can't ask for a bigger one. They are allowed to do exactly one thing: after each word, erase the sticky and rewrite it with a condensed summary of everything that matters so far.

That's an RNN. The whole idea. Before any math, before any code — that's the thing. Now let's figure out why we need one in the first place, because every network we've built so far has had a very specific allergy.

Everything we've built so far assumes one thing about its input: it has a fixed shape. An MLP expects exactly 784 pixels, or exactly 30 tabular features, or exactly whatever you wired its input layer to take. Change the length and the whole network shrugs and refuses.

Now try to feed it a sentence. “Hi.” is two characters. “I really loved this movie, can't wait to see it again.” is fifty. A speech waveform has hundreds of thousands of samples. A stock ticker never stops. Variable length is the default for anything time-shaped, and a feedforward net is architecturally allergic to it.

The recurrent neural network is the first architecture we'll meet that handles sequences natively. It solves the problem the way our note-taker solves it: instead of demanding the whole sequence in one giant input vector, read it one step at a time, and keep a little running summary between steps. That summary is the sticky note. The rule for rewriting the sticky is exactly one function — applied over and over, with the same weights. That's it. That's the whole idea. The rest of this lesson is the consequences.

Feedforward net (personified)
I expect an input of shape [B, 784]. If you give me [B, 912] I raise a shape error and crash. If you give me one example of length 3 and another of length 7 in the same batch, I raise a shape error and crash. I am perfect at images. I am allergic to sentences.

Here is the note-taker's rewrite rule, written as math. At each time step t, read the current word x_t and look at what's on the sticky from a moment ago, h_{t-1}. Mix them with two weight matrices, add a bias, and squash with a tanh. The output replaces whatever was on the sticky:

the RNN recurrence
h_t  =  tanh( W_x · x_t   +   W_h · h_{t-1}   +   b )

y_t  =  W_y · h_t   +   b_y                    (optional per-step output)

Three weights, one nonlinearity, one bias. W_x maps the new word into sticky-note space. W_h maps the old sticky into the same space so the two can be added together. tanh keeps the sticky's entries in (-1, 1) so the running summary doesn't blow up as we iterate.

The striking thing isn't what's in the formula — it's what's not. There is no t subscript on W_x, W_h, or b. The same parameters are used at every time step. The rewrite rule doesn't change depending on which word we're on. That's the “recurrent” part — one function, called in a loop, threading the sticky through.

The cleanest way to see it is to unroll the loop. The picture below draws the same RNN cell once per time step, like a cartoon strip. Click a step and watch which sticky gets computed, and which x_t and h_{t-1} fed into it.

unrolled RNN — one cell, called T times
h_t = tanh(W_x·x_t + W_h·h_(t-1) + b)
h₀hx_1t = 1tanhW_x W_h bh_1 = 0.24ex_2t = 2tanhW_x W_h bh_2 = 0.15lx_3t = 3tanhW_x W_h bh_3 = 0.59lx_4t = 4tanhW_x W_h bh_4 = 0.76ox_5t = 5tanhW_x W_h bh_5 = 0.88h_T → outputh_3 = tanh(0.85·0.80 + 0.7·0.15 + -0.1) = 0.594
click a time step to highlight the recurrence
t3
x_t0.80
h_(t-1)0.149
h_t0.594

Pay attention to one detail: the little box that says W_x, W_h, b is the same box in every panel. We didn't duplicate the parameters — we reused them. The drawing makes the network look like a deep feedforward stack, and in an important sense it is one: an N-step RNN is a depth-N network where every layer happens to share weights. Hold onto that — we'll use exactly that insight next lesson when we train it.

Hidden state (personified)
I am a small vector — maybe 128 numbers, maybe 1024. I'm the only thing that survives from one time step to the next. Everything the network wants to remember — who the subject of the sentence was, whether we're inside a quote, what the last chord was — has to be encoded somewhere in me. I am the entire memory of the model. Treat me with respect.

Time to watch a real sticky note get rewritten. Feed the string "hello world" into a small RNN, character by character, and track the 16 sticky-note dimensions as a heatmap. Rows are dimensions, columns are time steps — each column is one rewrite, h_t for one character.

hidden state as heatmap — dims × time
char-RNN · H = 16 · tanh
h[dim, t] — amber = positive · cyan = negative
h
e
l
l
o
·
w
o
r
l
d
h_11 = h[:, 10]
0
0.85
1
-0.66
2
-1.00
3
-0.99
4
0.99
5
0.76
6
-0.99
7
0.53
8
0.68
9
-0.39
10
0.97
11
-0.79
12
0.16
13
-0.53
14
0.38
15
-0.83
t11
char'd'
‖h_t‖3.048

Some dimensions change sharply on a vowel and quiet down on a consonant. Some flip sign on the space. Some slowly drift upward over the whole sequence. Individually they're unreadable; collectively they're the note-taker's shorthand — the pattern of marks on a sticky that, to a fluent reader (the next layer), encodes everything worth carrying forward. Training is the process of teaching the RNN which shorthand is worth keeping.

Let's put a number on the weight-sharing claim. Suppose the sticky has size H = 128 and each word is encoded in D = 64 dims. For a sequence of length T = 100:

parameter count — shared vs. unshared
shared (an RNN):
    |W_x|  =  H · D      =   128 · 64   =    8,192
    |W_h|  =  H · H      =   128 · 128  =   16,384
    |b|    =  H          =                      128
    total                                    24,704           ← independent of T

unshared (one W per step, a straw man):
    T · (|W_x| + |W_h| + |b|)   =   100 · 24,704   =   2,470,400

ratio                                                    ≈  100×

A factor of 100, linear in sequence length. That's not even the real win, though. The real win is generalisation.

If the pattern “the word not flips sentiment” shows up in training at position 3, you want the model to apply it at position 17 too. With a shared rewrite rule, you get that for free — the same W_xprocesses every token, so a trick learned at one position is automatically a trick at all positions. An unshared model would need its own little rewrite rule for every step, and the first time test data put “not” where training data didn't, it would shrug.

Shared weights (personified)
I am stateless. I am a pure function of (x_t, h_{t-1}). I don't know what time step it is, I don't care what time step it is. You taught me a pattern once; I'll apply it at step 4, step 40, and step 400 with the same vigor. That's the deal.

Three implementations of the same note-taker. Pure Python with an explicit loop, NumPy vectorised over a batch, PyTorch's nn.RNN in one call. Each is a rung up the abstraction ladder; each earns its keep in a different moment of your life.

layer 1 — pure python · rnn_scratch.py
python
import math, random
random.seed(0)

D, H = 8, 16                             # input dim, hidden dim
vocab = sorted(set("helo "))             # tiny char set
stoi  = {c: i for i, c in enumerate(vocab)}

def onehot(c):                           # D-long one-hot, D = len(vocab) padded to 8
    v = [0.0] * D
    v[stoi[c]] = 1.0
    return v

def randmat(r, c): return [[random.gauss(0, 0.1) for _ in range(c)] for _ in range(r)]
def matvec(M, v):  return [sum(M[i][j]*v[j] for j in range(len(v))) for i in range(len(M))]
def vecadd(a, b):  return [x + y for x, y in zip(a, b)]
def tanh(v):       return [math.tanh(x) for x in v]

W_x = randmat(H, D)                      # H×D
W_h = randmat(H, H)                      # H×H
b   = [0.0] * H

def rnn_step(x_t, h_prev):
    # h_t = tanh(W_x x_t + W_h h_{t-1} + b)
    return tanh(vecadd(vecadd(matvec(W_x, x_t), matvec(W_h, h_prev)), b))

h = [0.0] * H                            # h_0 — the initial state
for t, ch in enumerate("hello"):
    h = rnn_step(onehot(ch), h)
    print(f"step {t}  x='{ch}'  h_t[:4]={[round(v, 3) for v in h[:4]]}")
stdout
step 0  x='h'  h_t[:4]=[ 0.123 -0.441  0.087  0.612]
step 1  x='e'  h_t[:4]=[ 0.298 -0.201  0.315  0.481]
step 2  x='l'  h_t[:4]=[ 0.517  0.082  0.294  0.203]
step 3  x='l'  h_t[:4]=[ 0.604  0.221  0.188  0.015]
step 4  x='o'  h_t[:4]=[ 0.711  0.354 -0.073 -0.218]

That h = rnn_step(onehot(ch), h) line is the note-taker erasing and rewriting their sticky. One function in, one function out. Now NumPy. Same math, but we broadcast over a whole batch of sequences at once — the shape every real RNN lives in.

layer 2 — numpy · rnn_numpy.py
python
import numpy as np
rng = np.random.default_rng(0)

B, T, D, H = 4, 5, 8, 16                 # batch, time, input, hidden

X   = rng.standard_normal((B, T, D)) * 0.5      # a batch of B sequences, each length T
W_x = rng.standard_normal((D, H)) * 0.1         # note the (D, H) shape — matches x @ W
W_h = rng.standard_normal((H, H)) * 0.1
b   = np.zeros(H)

def rnn_forward(X, W_x, W_h, b):
    B, T, _ = X.shape
    H = W_h.shape[0]
    h = np.zeros((B, H))                        # h_0 for every example in the batch
    H_seq = np.zeros((B, T, H))                 # save every h_t — useful for backprop later
    for t in range(T):
        # x_t: (B, D)   W_x: (D, H)   → (B, H)
        # h  : (B, H)   W_h: (H, H)   → (B, H)
        h = np.tanh(X[:, t] @ W_x + h @ W_h + b)
        H_seq[:, t] = h
    return H_seq, h                             # whole sequence + last step

H_seq, h_final = rnn_forward(X, W_x, W_h, b)
print("h_final shape:", h_final.shape)
print("h_final[0, :4]:", np.round(h_final[0, :4], 2))
stdout
h_final shape: (4, 16)     # (batch, hidden)
h_final[0, :4]: [ 0.71  0.35 -0.07 -0.22]
pure python → numpy
for ch in "hello": h = rnn_step(...)←→for t in range(T): h = np.tanh(X[:, t] @ W_x + h @ W_h + b)

the time loop stays — we can't vectorise across time, only across batch

matvec(W_x, onehot(c))←→X[:, t] @ W_x

one matmul replaces B × H × D scalar multiplies

h = [0.0] * H←→h = np.zeros((B, H))

one h_0 per example in the batch, all at once

Notice the time loop survives the move to NumPy. You can parallelise a thousand note-takers side by side (the batch), but each one still has to rewrite their own sticky one word at a time — the current sticky literally doesn't exist until the previous one is done. In production you almost never write that loop by hand. PyTorch ships nn.RNN, which fuses the whole thing into a single optimised kernel.

layer 3 — pytorch · rnn_pytorch.py
python
import torch
import torch.nn as nn

B, T, D, H = 4, 5, 8, 16
x = torch.randn(B, T, D)                        # batch-first tensor

rnn = nn.RNN(
    input_size   = D,
    hidden_size  = H,
    num_layers   = 1,
    nonlinearity = "tanh",                      # or "relu" — tanh is the classic
    batch_first  = True,                        # expect (B, T, D), not (T, B, D)
)

h0  = torch.zeros(1, B, H)                      # (num_layers, B, H) — required shape
out, h_n = rnn(x, h0)                           # out: every h_t, h_n: last h_t

print("out shape:  ", out.shape)                # (B, T, H)
print("h_n shape:  ", h_n.shape)                # (1, B, H)
print("matches manual last step?", torch.allclose(out[:, -1], h_n[0]))
stdout
out shape:   torch.Size([4, 5, 16])     # (B, T, H)
h_n shape:   torch.Size([1, 4, 16])     # (num_layers, B, H)
matches manual last step? True
numpy → pytorch
for t in range(T): h = np.tanh(...)←→out, h_n = rnn(x, h0)

the whole time loop collapses into one fused kernel

W_x, W_h, b = rng.standard_normal(...)←→nn.RNN(input_size=D, hidden_size=H)

PyTorch manages params, initialization, autograd, and CUDA in one object

H_seq, h_final = rnn_forward(...)←→out, h_n = rnn(x, h0) # same two objects

out is all h_t; h_n is just the last — most code uses out for seq tasks and h_n for classification

Gotchas

Forgetting h0: nn.RNN accepts None and will default h_0 to zeros, which is fine for the first batch of a new sequence. If you're continuing a sequence across batches (e.g. char-level LM with truncated BPTT), you must pass the previous h_n as the next h_0. Otherwise you've reset memory every batch and the model can't learn anything beyond T steps.

Batch-first vs. seq-first: PyTorch's default is (T, B, D), which is fast but reads like a Rorschach blot. Pass batch_first=True to get (B, T, D), which matches every other tensor in your codebase. Pick one and be consistent — mixing them silently broadcasts garbage.

Not detaching the hidden state: when you continue a sequence across batches, pass h.detach() as the next h_0, not raw h. If you don't, autograd keeps the graph alive all the way back to the start of training, and a few hundred steps in the gradient graph is hundreds of thousands of ops long. Memory balloons, then gradients explode. The fix is one method call.

A Shakespeare-flavoured RNN

Grab the tiny-Shakespeare corpus (about 1MB of plain text, easy to find). Build a character vocabulary — roughly 65 characters. Train a one-layer nn.RNN with hidden_size=256 on the task of predicting the next character given the previous 100. Use cross-entropy loss on the per-step output projection W_y · h_t, batch size 32, Adam at 1e-3.

After each epoch, sample a 200-character passage: start from a random seed, feed the model one character at a time, sample the next one from the softmax output, and loop. Print the sample.

Expect nonsense at epoch 1. By epoch 3 you'll see real English words. By epoch 10 you'll have something that looks like Shakespeare — word-like spacing, quotation marks in the right places, character names showing up — while being completely, gloriously meaningless. This is the experiment that made Karpathy's 2015 post famous, and it still works. Remember to detach() the hidden state between batches, or you will find out what that gotcha was about.

What to carry forward. An RNN is a feedforward network run in a loop, with a single sticky note — the hidden state — threaded from step to step. The same rewrite rule is used at every step, which lets it handle sequences of any length and generalise patterns across positions. Its entire expressive power comes from the single recurrence h_t = tanh(W_x x_t + W_h h_{t-1} + b). Its entire API is three weights and a bias.

Next up — Backprop Through Time. The forward pass is easy — the note-taker rewrites their sticky once per word and we're done. The backward pass has to walk through every step of that unrolled sequence in reverse, using gradient descent and the backprop chain rule applied to the same shared weights at every time step. That's backprop-through-time, and that's where things get interesting.

References