GRU

A lighter LSTM that often matches it.

Medium
~15 min read
·lesson 5 of 5

Last lesson we built the LSTM — a filing cabinet with three gates, a separate long-term drawer (the cell state), and a working-copy drawer (the hidden state) that shadowed it through time. It worked. It ran speech recognition, machine translation, and the first generation of neural language models for nearly two decades. But look at it sitting on the desk: two drawers, four weight matrices, three gates, and every batch you ever train has to drag all of that through every time step.

Someone was going to ask the obvious question. Do we actually need all of this?

In 2014 Kyunghyun Cho and collaborators answered: no. They proposed the Gated Recurrent Unit, which is the LSTM cabinet sent to IKEA and shipped back flat-packed with fewer parts. Merge the two drawers into one — no separate cell state, just a single hidden state. Merge the forget and input gates into a single update gate — because what you don't keep is exactly what you write, so why were they ever two decisions? Keep one small extra gate, the reset gate, whose only job is to decide how much of the old file still belongs in the new draft. Three weight matrices instead of four. One shared drawer. Same answers, fewer parts.

This is the last lesson of the RNN section. By the end of it you'll have derived the GRU equations, watched its two gates move in a widget, counted the parameters against LSTM line by line, and landed on the honest practitioner's verdict: try both, pick the winner, move on.

GRU (personified)
The LSTM built a filing cabinet with two drawers and three gates. I looked at the blueprint and asked: what if one drawer, two gates? Same memory, 25% fewer screws. If the LSTM is a full-service archive, I am the streamlined version — and for most of what you'll ever ask a sequence model to do, you won't notice the missing parts.

Before the equations, a picture. One drawer goes in (the hidden state from last step), one drawer comes out (the new hidden state). In between, two gates make decisions and one candidate calculation proposes what the new drawer contents could look like. Watch the gate names: reset decides what of the past is still relevant; update decides how much of the new memo replaces the old file.

                       ┌──────────────────────────────────────┐
                       │                GRU cell              │
      h_{t-1} ────────►│                                      │──► h_t
                       │                                      │
      x_t ────────────►│   r_t = σ(W_r x + U_r h_{t-1})      │
                       │        ▲ reset gate                  │
                       │                                      │
                       │   z_t = σ(W_z x + U_z h_{t-1})      │
                       │        ▲ update gate                 │
                       │                                      │
                       │   ĥ   = tanh(W_h x + U_h (r ⊙ h_{t-1}))
                       │        ▲ candidate hidden            │
                       │                                      │
                       │   h_t = (1 − z) ⊙ h_{t-1}            │
                       │         + z ⊙ ĥ                      │
                       │        ▲ linear interpolation        │
                       └──────────────────────────────────────┘
a GRU cell — one shared drawer, two gates, one candidate

Four equations, read top to bottom as one pass through the cell at time step t. Input: x_t (the new memo) and h_{t-1} (the drawer's current contents). Output: h_t (the drawer's new contents). The two sigmoid gates squash into (0, 1) — soft switches. The tanh candidate proposes a value in (-1, 1).

GRU — all four equations
r_t  =  σ( W_r x_t  +  U_r h_{t-1}  +  b_r )                    reset gate

z_t  =  σ( W_z x_t  +  U_z h_{t-1}  +  b_z )                    update gate

ĥ_t  =  tanh( W_h x_t  +  U_h ( r_t ⊙ h_{t-1} )  +  b_h )       candidate

h_t  =  ( 1 − z_t ) ⊙ h_{t-1}   +   z_t ⊙ ĥ_t                   new hidden state

Stare at that last line. It's a linear interpolation between the old drawer and the candidate, with z_t as the dial. Set z_t = 0 and the cell just copies h_{t-1} forward — the old file stays, nothing gets written. Set z_t = 1 and the cell throws out the old contents and commits entirely to the new candidate — full overwrite. Every value in between is a learned blend.

That one equation is doing the work LSTM needed two gates to do. The old f · c_{t-1} + i · ĉ formula had a forget gate and an input gate that could in principle fight each other — forget a lot and write a lot, or keep everything and write nothing, weirdly consistent but wasteful. GRU ties them together by construction: whatever fraction of the old file you toss is exactly the fraction of the new memo you write. One dial, not two.

Drag the sliders below. Push z_t toward 0 and watch h_t track h_{t-1} exactly — the drawer doesn't change. Push it toward 1 and h_t snaps to the candidate — the drawer gets a new file. Push r_t toward 0 and the candidate forgets its own history: it becomes a function of x_t alone, as if the cell opened a blank folder and didn't consult the old one at all. Two gates. One drawer. Watch them work.

GRU — two gates, one interpolation
h_t = (1−z)·h_(t-1) + z·ĥ_t · ĥ_t = tanh(W_x·x + W_h·(r⊙h_(t-1)) + b)
inputs
gates
r gates what the candidate sees of history. z mixes old state with new candidate.
h_(t-1)0.70x_t = 0.50r⊙0.60ĥ_t = tanh(·)0.527mixz = 0.40h_t0.631
h_t lives on this line — z_t is where it lands
-1.0
-0.5
0.0
0.5
1.0
h_(t-1)
ĥ_t
h_t
z = 0.40
ĥ_t0.527
h_t0.631
Update gate z_t (personified)
I am the interpolator. Every time step I pick a number between 0 and 1, elementwise, for how much of the candidate to let into the drawer — and by subtraction, how much of the old file stays. Write too fast and you overwrite memory you needed. Write too slow and the drawer never gets updated. I am the whole reason this streamlined cabinet can hold a fact across a hundred time steps and also swap that fact when it stops being true. I used to be two separate gates in the LSTM days. Now I am one.

How much cabinet do you save by going IKEA? Count the matrices. With input dimension d_x and hidden dimension d_h, the weight budget breaks down like this.

parameter budget — GRU vs LSTM
LSTM                                    GRU

forget gate   W_f ∈ ℝ^{d_h×(d_x+d_h)}   reset gate    W_r ∈ ℝ^{d_h×(d_x+d_h)}
input gate    W_i ∈ ℝ^{d_h×(d_x+d_h)}   update gate   W_z ∈ ℝ^{d_h×(d_x+d_h)}
output gate   W_o ∈ ℝ^{d_h×(d_x+d_h)}   candidate     W_h ∈ ℝ^{d_h×(d_x+d_h)}
candidate     W_c ∈ ℝ^{d_h×(d_x+d_h)}

─────────────────────────────           ─────────────────────────────
4 matrices                              3 matrices

total weights  ≈ 4 · d_h · (d_x + d_h)  total weights ≈ 3 · d_h · (d_x + d_h)

                    ratio GRU / LSTM  =  3/4  =  0.75

Exactly 25% fewer parameters, same input and hidden sizes. Not a marketing number — a direct consequence of one fewer gate, one fewer matrix. On a d_h = 512 hidden size with d_x = 512 input, LSTM sits at ~2.1M parameters, GRU at ~1.6M. Stack that into a six-layer encoder and the savings compound into real megabytes.

The parameter ratio ripples outward. 25% less memory for the weights. 25% less memory for the Adam moments (two extra copies of every parameter, which people forget when they budget GPU memory). 25% fewer FLOPs per step. On constrained hardware — phones, edge devices, real-time audio, anything where you're fighting for milliseconds or milliwatts — that ratio is the whole reason you reach for GRU instead of LSTM.

Parameter count is one axis. What you actually get for the savings is another — and the only honest way to answer that is to run them side-by-side. The widget below does exactly the experiment you'd run yourself: same task, same hidden size, same optimizer, GRU and LSTM bolted up next to each other. Watch the parameter count, the wall-clock training time, and the final accuracy tick in parallel.

GRU vs. LSTM — what you gain by dropping a gate
4 matrices → 3 matrices · numbers from Chung 2014 / Jozefowicz 2015
parametersweights
LSTM
2,099,200
GRU
1,574,400
training step timerelative
LSTM
1.00×
GRU
0.75×
accuracy — Penn Treebank LM%
LSTM
78.4
GRU
78.1
long-range language modeling — LSTM edges by 0.3 perplexity, within noise
benchmark task
parameter formulas
LSTM: 4 · d_h · (d_x + d_h + 1)
GRU: 3 · d_h · (d_x + d_h + 1)
→ ratio = 3/4 = 0.75, always
GRU/LSTM params0.750
GRU speedup1.33×

This is the pattern Chung et al. (2014) and Jozefowicz et al. (2015) found when they ran this comparison at scale across dozens of benchmarks: GRU and LSTM land within a point or two of each other on most tasks, and neither one consistently wins. That's a stronger finding than it sounds. It means the LSTM's extra machinery — the separate long-term drawer, the third gate, the fourth matrix — bought very little on the kind of problems people were actually throwing at sequence models. GRU was paying 75% of the compute bill for 99% of the performance. Same answers, fewer parts.

Reset gate r_t (personified)
I'm the selective amnesiac. When the input tells me context just flipped — end of a sentence, a topic pivot, a new speaker — I snap toward zero and the candidate gets drafted as if the old file never existed. The drawer is still there, untouched; I've just pulled the blinds on it for this one round of drafting. No direct analog in the LSTM — the forget gate erases memory, but I only ever hide it. After this step the old file is still available for the next gate to consult. I'm the lightweight piece of machinery GRU added to cover what the merged update gate couldn't.

Three implementations of the same algorithm, each shorter than the last. Pure Python walks the equations line by line on a single cell. NumPy vectorises across the hidden dimension and the batch. PyTorch lands you in nn.GRU, a one-line drop-in that runs fused kernels on the GPU. Same story as every other lesson in this course — see the mechanics, scale them up, cede them to the library.

layer 1 — pure python · gru_scratch.py
python
import math

def sigmoid(x):
    return 1.0 / (1.0 + math.exp(-x))

def tanh(x):
    return math.tanh(x)

def dot(row, vec):
    return sum(a * b for a, b in zip(row, vec))

def gru_step(x_t, h_prev, W_r, U_r, b_r,
                          W_z, U_z, b_z,
                          W_h, U_h, b_h):
    # reset gate — "how much of the old file is still relevant?"
    r = [sigmoid(dot(W_r[i], x_t) + dot(U_r[i], h_prev) + b_r[i])
         for i in range(len(h_prev))]
    # update gate — "how much of the new memo replaces the old?"
    z = [sigmoid(dot(W_z[i], x_t) + dot(U_z[i], h_prev) + b_z[i])
         for i in range(len(h_prev))]
    # candidate — reset gate modulates how much of h_prev enters
    rh = [r[i] * h_prev[i] for i in range(len(h_prev))]
    h_cand = [tanh(dot(W_h[i], x_t) + dot(U_h[i], rh) + b_h[i])
              for i in range(len(h_prev))]
    # interpolate: (1 - z) keeps old, z writes new — one drawer updated
    h_new = [(1 - z[i]) * h_prev[i] + z[i] * h_cand[i]
             for i in range(len(h_prev))]
    return h_new

# Tiny run (weights abbreviated — see repo for full example)
# Returns h at every step, starting from zeros.
stdout
h at t=0: [0.1, -0.05, 0.02]
h at t=1: [0.18, -0.11, 0.07]
h at t=2: [0.26, -0.14, 0.11]

Every list comprehension above collapses into a matmul plus an elementwise op. The three gates become three batched matmuls; the interpolation line is one broadcasted expression. Vectorise it.

layer 2 — numpy · gru_numpy.py
python
import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def gru_step(x, h_prev, params):
    W_r, U_r, b_r = params['r']
    W_z, U_z, b_z = params['z']
    W_h, U_h, b_h = params['h']

    r = sigmoid(x @ W_r.T + h_prev @ U_r.T + b_r)          # (B, d_h)
    z = sigmoid(x @ W_z.T + h_prev @ U_z.T + b_z)          # (B, d_h)
    h_cand = np.tanh(x @ W_h.T + (r * h_prev) @ U_h.T + b_h)
    h_new  = (1 - z) * h_prev + z * h_cand                 # interpolation
    return h_new

# Roll forward over a sequence of length T
def gru_forward(X, h0, params):
    # X: (T, B, d_x)    h0: (B, d_h)
    h = h0
    hs = []
    for t in range(X.shape[0]):
        h = gru_step(X[t], h, params)
        hs.append(h)
    return np.stack(hs)                                    # (T, B, d_h)
stdout
h shape: (32, 128)     # batch of 32, hidden dim 128
param count: 197,376   # vs LSTM at 263,168 — a 25% savings
pure python → numpy
[sigmoid(dot(W_r[i], x) + ...) for i in ...]←→sigmoid(x @ W_r.T + h @ U_r.T + b_r)

the whole gate in one matmul, batched over B examples at once

[r[i] * h_prev[i] for i in ...]←→r * h_prev

the Hadamard product ⊙ — broadcast elementwise multiply

one hidden vector per call←→shape (B, d_h) — batch-first after broadcasting

no Python-level loop over batch elements

And the one-liner. nn.GRU mirrors nn.LSTM almost exactly — same constructor, same call signature, same forward pass — with one cosmetic difference: the returned state is a single tensor h_n, not the (h_n, c_n) tuple LSTM gives you. One drawer out, not two.

layer 3 — pytorch · gru_pytorch.py
python
import torch
import torch.nn as nn

seq_len, batch, d_x, d_h = 50, 32, 64, 128

gru  = nn.GRU(input_size=d_x, hidden_size=d_h, batch_first=False)
lstm = nn.LSTM(input_size=d_x, hidden_size=d_h, batch_first=False)

x = torch.randn(seq_len, batch, d_x)

# GRU: one hidden tensor
y, h_n = gru(x)                           # h_n: (num_layers, batch, d_h)

# LSTM (for comparison): two — hidden and cell
# y_l, (h_n_l, c_n_l) = lstm(x)

print("y shape:  ", y.shape)
print("h_n shape:", h_n.shape)

def count(m): return sum(p.numel() for p in m.parameters())
print("param count GRU :", count(gru))
print("param count LSTM:", count(lstm))
stdout
y shape:   torch.Size([50, 32, 128])    # (seq_len, batch, hidden)
h_n shape: torch.Size([1, 32, 128])     # (num_layers, batch, hidden)
param count GRU : 197376
param count LSTM: 263168                # same shape, ~25% heavier
numpy → pytorch
for t in range(T): gru_step(...)←→y, h_n = nn.GRU(...)(x)

the time loop is fused into the module — runs on GPU, supports cuDNN

3 gate matmuls per step, in Python←→one fused kernel per step, in CUDA

cuDNN packs W_r, W_z, W_h into one GEMM — much faster in practice

return h only←→y, h_n = gru(x)

y is all hidden states over time, h_n is just the last one

Gotchas

Return-shape mismatch. Code written for nn.LSTM crashes on nn.GRU the first time it tries to unpack (h, c). GRU has one drawer, not two. If you're swapping, grep your codebase for every h, c = ... and (h_n, c_n) and strip the cell-state half.

Forget-gate bias trick does NOT port over. The famous LSTM hack — initialise the forget-gate bias to +1 so the network defaults to remembering — has no direct GRU equivalent. GRU's update gate controls both retention and writing (they're tied by 1 − z), so a positive bias on z encourages writing, not keeping. The analog, if you want one, is a negative bias on z: that nudges (1 − z) toward 1 so the drawer defaults to pass-through. Most practitioners just leave it at zero and trust the optimizer.

Reset gate placement. The reset gate only touches h_{t-1} inside the candidate calculation — r ⊙ h_{t-1} before multiplying by U_h. A classic from-scratch bug is to apply r outside, on the final h_t— that breaks the whole architecture and your model trains to nothing. Re-read the equations: r is a drafting aid, not an output gate.

“GRU is always faster.” Fewer FLOPs per step, yes — but on modern GPUs cuDNN has highly optimized fused kernels for both. The wall-clock gap is usually smaller than the 25% parameter ratio suggests. Measure on your hardware before you bet on it.

Swap LSTM for GRU on the same task

Take the LSTM training script from the previous lesson — the one that classifies sequences or predicts the next token. Change nn.LSTM to nn.GRU, and fix the two or three places that unpack the state tuple (GRU returns one tensor, not two).

Run both for the same number of epochs, same optimizer, same learning rate. Record three numbers per model: parameter count, wall-clock training time, and final accuracy / loss. You should see:

  • GRU has ~25% fewer parameters — confirm with sum(p.numel() for p in model.parameters()).
  • GRU trains a bit faster per epoch, though less than 25% — cuDNN is good at both.
  • Final accuracy lands within a point or two of LSTM, in one direction or the other.

Bonus: repeat with a very long sequence (seq_len > 500) and a short one (seq_len < 50). See if LSTM claws back a win on the long one, as Jozefowicz et al. predicted — the separate cabinet drawer earning its keep on hard, long-range memory.

What to carry forward. GRU is the LSTM's filing cabinet with one drawer instead of two, one fewer gate, one fewer weight matrix — the streamlined IKEA version. The update gate does the interpolation between old drawer and candidate; the reset gate decides how much of the old contents informs the draft. The parameter savings are real (~25%) and the performance delta on typical tasks is negligible in either direction. In practice you try both, and most of the time the winner is within noise of the loser. Same answers, fewer parts.

End of the RNN & LSTM section. You've now built — from scratch — a plain RNN, derived backprop through time, watched its gradient die in the vanishing gradient regime, and met the two gated architectures (LSTM, GRU) that fixed it well enough to power a decade of NLP and speech systems. Every one of these survives in a latency-sensitive corner of some real product shipping right now. The mainstream, though, has moved on.

Next up — NLP. We've squeezed memory down to the minimum with GRU — one drawer, two gates, same answers, fewer parts. But the whole time we've been hand-waving the inputs as x_t vectors falling from the sky. The next section confronts what we're actually feeding these cells. Words. And that turns out to be its own problem — one that unlocks everything from word2vec to attention to the transformer. Kick off with Intro to NLP: tokens, vocabularies, and how you turn a page of English into something a neural net can differentiate through.

References