LSTM
Gates, cell state, and the first real fix for long memory.
Last lesson the RNN's memory got whispered down a long chain of clerks, and by clerk fifty the message was gibberish. The vanishing gradient wasn't a training hiccup — every step in BPTT multiplied the signal by W · tanh'(·), and both factors made it smaller. No learning rate fixes that. The architecture itself has to change.
So picture a different office. Instead of one clerk trying to carry every memory in their head, put a filing cabinet in the room. A long drawer running alongside the whole hallway of clerks. Give the cabinet three gates — three small employees who together decide what gets written, what gets kept, and what gets read aloud to the next room. That cabinet, riding quietly past every clerk, is the thing Sepp Hochreiter and Jürgen Schmidhuber added to RNNs in 1997. They called it the Long Short-Term Memory network.
The LSTM was the dominant sequence model for two decades — speech recognition, handwriting, machine translation, the first neural language models that worked at scale. Google Translate was an LSTM stack as late as 2016. Every speech-to-text system on every phone until about 2018 ran one of these. The reason is the filing cabinet: a dedicated drawer where information can sit unchanged across hundreds of time steps, with three gates controlling what goes in, what stays, and what comes out.
The vanilla RNN tried to hold every memory in its hands while running the hallway. I installed a filing cabinet. A clerk files the important bits, a shredder trims the stale bits, and another clerk reads the relevant drawer aloud to the next room. The paper never gets squashed in transit — that's why my gradient survives the trip.
Before the equations, the floor plan. An LSTM cell takes three things in: the input at this time step x_t, the previous hidden state h_{t-1} (what the last room shouted over the wall), and the previous cell state c_{t-1} (the drawer rolling in from the last room). It produces two things out: the new drawer contents c_t and the new shouted summary h_t. Inside, four tiny networks stand at the cabinet — three gates and one candidate note to file.
┌───────────── c_t ──────────────►
│ (the cabinet drawer, rolling through)
c_{t-1} ──►──┤ × ──── + ────────┬─────────────►
│ │ │
f_t i_t·g_t │
▲ ▲ │
│ │ tanh
│ │ │
│ │ × ◄──── o_t
│ │ │
│ │ ▼
│ │ h_t ────────────►
│ │ (what the clerk reads aloud)
[ h_{t-1}, x_t ] ───┴─► W_f, W_i, W_g, W_o ─┘
f_t forget gate = the shredder (sigmoid)
i_t input gate = the intake clerk (sigmoid)
g_t candidate = the note they want to file (tanh)
o_t output gate = the clerk reading aloud (sigmoid)Four small networks, one per role, each reading the same concatenated input [h_{t-1}, x_t] — what the last room shouted, plus the new paperwork arriving at the door. Three of them are sigmoid gates producing numbers in (0, 1) — soft on/off switches, a dial from “fully closed” to “wide open.” The fourth is a tanh that proposes an actual value in (-1, 1) — the note the intake clerk wants to drop into the drawer.
f_t = σ(W_f · [h_{t-1}, x_t] + b_f) ← shredder (0,1)
i_t = σ(W_i · [h_{t-1}, x_t] + b_i) ← intake clerk (0,1)
g_t = tanh(W_g · [h_{t-1}, x_t] + b_g) ← the note to file (-1,1)
o_t = σ(W_o · [h_{t-1}, x_t] + b_o) ← hallway reader (0,1)c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t ← update the drawer
h_t = o_t ⊙ tanh(c_t) ← read out to the hallwayRead c_t aloud as the office scene. “Take yesterday's drawer, run it past the shredder (keep what you want, shred the rest), then add the intake clerk's new note.” That plus sign is the entire trick. A vanilla RNN had a multiply where the filing cabinet has an add — and the derivative of an add is 1, so the signal rolls through cleanly.
And h_t: “Squash the drawer contents through tanh to bound them, then let the output clerk decide what to read aloud.” The cell state holds the whole file; the hidden state is a filtered, bounded excerpt — what the clerk chose to shout into the hallway. Two channels, two jobs.
Pull the gate sliders. Each gate is a single number in its allowed range — set them by hand, feed in an x_t and a h_{t-1}, and watch c_t and h_t get assembled piece by piece. The widget is the five lines of math above, made physical.
Three settings to try, because each one is a different office scene. Shredder off, intake closed (f_t = 1, i_t = 0): the drawer is perfectly preserved. Whatever was in c_{t-1} becomes c_t unchanged. Nobody touched the cabinet this step — a pure pass-through. Shredder on, intake open (f_t = 0, i_t = 1): completely dump the old drawer and refile with the new candidate note. Useful when the sequence hits a hard boundary — a sentence ended, a new clause begins. Output gate closed (o_t = 0): the hidden state is zero even though the drawer may be full of useful paper. The network is remembering but keeping quiet — not telling the hallway yet.
I stand between the incoming drawer and the outgoing drawer with a paper shredder. A value of 1 means the page goes through untouched. A value of 0 means confetti. The rest of the network learns when I should fire. Default me to “leave it alone” on day one — I'll explain why in the gotchas.
I'm the clerk at the filing cabinet's front desk. The tanh next to me has already written a candidate note — my job is to decide how much of that note actually gets dropped into the drawer. Near 1, I file it. Near 0, I toss it in the recycling. Pair of scissors for situational commitment.
I read the drawer aloud to the next room. Not everything in the cabinet is relevant right now — so I cherry-pick. The cell state is the whole archive. The hidden state is me, clearing my throat and reading the bits that matter for this decision.
Here is the payoff. In the vanilla RNN the per-step gradient factor was ∂h_t / ∂h_{t-1} = diag(tanh') · W — a full matrix multiplied by a squashing derivative, the two ingredients in the telephone-chain decay. Now repeat that calculation for the filing cabinet.
c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t
∂c_t = f_t (+ small terms, since f_t, i_t, g_t depend on h_{t-1})
─────
∂c_{t-1}
Over T steps, treating the gates as approximately piecewise-constant:
∂c_T / ∂c_0 ≈ f_T ⊙ f_{T-1} ⊙ ... ⊙ f_1Compare the two. The RNN's per-step factor is a full matrix times a squashing derivative. The filing cabinet's per-step factor along the drawer is just a diagonal — the shredder — with nothing squashing on top of it. If the shredder stays near 1 (leave the paper alone), the gradient through the drawer survives essentially indefinitely. If the shredder drifts toward zero, yes, the gradient shrinks along that dimension — but only because the network chose to forget that piece. The vanilla RNN forgot by accident. The LSTM forgets on purpose.
The full gradient has more paths than the clean f_t product above — the gates themselves depend on h_{t-1}, which depends on c_{t-1}, so there are side channels. But the dominant term is the one written, and in practice it carries gradient over hundreds of steps where the vanilla RNN dies inside of twenty.
Let's watch the drawer hold a page across the whole hallway. Fifty time steps, one bit of information presented at step 5 with a “remember this” marker. The network's job is to keep that bit alive in the cabinet until step 50, when a recall cue asks for it. Scrub through time and watch the cell state.
Notice how the trained LSTM latches the bit. When the marker arrives at step 5, the intake clerk opens briefly — one file drop. The shredder then sits at 1 for the rest of the sequence (leave the drawer alone). The hallway reader stays quiet until step 50, when it finally reads the drawer out loud. The cell state at step 49 is effectively identical to its value at step 6. Fifty steps of perfect retention, trained with plain backprop through time. A vanilla RNN handed the same task cannot solve it — its gradient dies long before the loss can teach it what to hold.
Three passes at the filing cabinet, each shorter than the last. Pure Python is the equations line-for-line on a single time step — the cabinet drawn by hand. NumPy vectorises across the hidden dimension and collapses the four gates into one fused matmul. PyTorch drops the whole thing into nn.LSTM with packed sequences — the version you'll actually ship.
import math
def sigmoid(x): return 1.0 / (1.0 + math.exp(-x))
def tanh(x): return math.tanh(x)
def lstm_cell(x, h_prev, c_prev, W, b):
"""One step past the filing cabinet. Hidden size 2. W is a dict of 4 gate matrices."""
# concatenate last room's shout with the new paperwork at the door
z = [*h_prev, *x] # [h_{t-1}; x_t]
# three gates + one candidate note, one dot product each
def gate(Wg, bg, act):
return [act(sum(wij * zj for wij, zj in zip(row, z)) + b)
for row, b in zip(Wg, bg)]
f = gate(W["f"], b["f"], sigmoid) # shredder (forget gate)
i = gate(W["i"], b["i"], sigmoid) # intake clerk (input gate)
g = gate(W["g"], b["g"], tanh) # note to file (candidate)
o = gate(W["o"], b["o"], sigmoid) # hallway reader (output gate)
# drawer update — the two lines everything rides on
c = [fi * cp + ii * gi for fi, cp, ii, gi in zip(f, c_prev, i, g)]
h = [oi * tanh(ci) for oi, ci in zip(o, c)]
return h, ct=1 c=[0.31, -0.12] h=[0.22, -0.08] t=2 c=[0.47, 0.05] h=[0.33, 0.03]
Vectorise. The four gates collapse into one fused matmul of 4× the output size, split at the end. This is how every production LSTM kernel actually ships — one big matrix multiply beats four small ones on hardware every time.
import numpy as np
def sigmoid(x): return 1.0 / (1.0 + np.exp(-x))
def lstm_cell(x, h_prev, c_prev, W, b):
"""
Fused filing cabinet. W has shape (4h, d + h); b has shape (4h,).
We split the 4h-dimensional pre-activation into the three gates + candidate at once.
"""
z = np.concatenate([h_prev, x], axis=-1) # (d + h,)
pre = W @ z + b # (4h,)
f, i, g, o = np.split(pre, 4, axis=-1) # each (h,) — one slice per role
f = sigmoid(f) # shredder
i = sigmoid(i) # intake clerk
g = np.tanh(g) # candidate note
o = sigmoid(o) # hallway reader
c = f * c_prev + i * g # drawer update — the magic line
h = o * np.tanh(c)
return h, c
# rolling the cabinet through the whole hallway
def lstm_forward(X, W, b, h_dim):
T = X.shape[0]
h = np.zeros(h_dim); c = np.zeros(h_dim)
H = np.zeros((T, h_dim))
for t in range(T):
h, c = lstm_cell(X[t], h, c, W, b)
H[t] = h
return Hfour separate gate() calls←→one fused matmul, np.split into four— hardware loves one big GEMM over four small ones
list-comprehensions for elementwise ops←→f * c_prev + i * g (broadcasted)— batch + hidden dims come free once you are in numpy
loops over hidden units inside the cell←→only the time loop remains — everything else is vectorised— the hallway stays sequential; the cabinet itself is parallel
PyTorch packs all four roles, handles batching, runs on GPU, and — critically — supports packed sequences: the standard way to feed variable-length batches through the cabinet without wasting compute on padding tokens. The whole cell becomes a one-liner.
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
class LstmTagger(nn.Module):
def __init__(self, vocab, embed=64, hidden=128, n_tags=10):
super().__init__()
self.embed = nn.Embedding(vocab, embed)
self.lstm = nn.LSTM(embed, hidden, batch_first=True) # all four roles, fused
self.head = nn.Linear(hidden, n_tags)
# GOTCHA PREVIEW: tell the shredder to leave the drawer alone on day one.
# PyTorch packs biases as [b_i, b_f, b_g, b_o]; hidden:2*hidden slice is forget.
for name, p in self.lstm.named_parameters():
if "bias" in name:
p.data.fill_(0.0)
n = p.shape[0] // 4
p.data[n:2 * n].fill_(1.0) # forget-gate bias = 1 → default to remember
def forward(self, x, lengths):
emb = self.embed(x) # (B, T, E)
packed = pack_padded_sequence(emb, lengths.cpu(),
batch_first=True, enforce_sorted=False)
out, (h_T, c_T) = self.lstm(packed) # packed -> packed
out, _ = pad_packed_sequence(out, batch_first=True) # (B, T, H)
return self.head(out), (h_T, c_T)
model = LstmTagger(vocab=5000)
x = torch.randint(0, 5000, (32, 7)) # (batch=32, max_len=7)
lengths = torch.tensor([7, 6, 6, 5, 5, 4, 3, 3, 2, 2] + [7]*22)
logits, (h_T, c_T) = model(x, lengths)
print("output shape: ", logits.shape)
print("final h_T: ", h_T.shape)
print("final c_T: ", c_T.shape)output shape: torch.Size([32, 7, 128]) # (batch, time, hidden) final h_T: torch.Size([1, 32, 128]) final c_T: torch.Size([1, 32, 128])
hand-written time loop←→nn.LSTM(..., batch_first=True)— CUDA-optimised fused kernel — orders of magnitude faster
ragged sequences → mask or pad manually←→pack_padded_sequence + pad_packed_sequence— no compute wasted on padding positions
autograd for backprop through time←→loss.backward() — just works— the messy gradient chains from earlier are handled for you
A quick tour of cabinet variants you'll see in the wild. Most don't matter much in 2026, but they show up in papers and legacy codebases:
- Peephole connections. Gers & Schmidhuber (2000) let each gate peek directly at
c_{t-1}— the gates read the drawer, not just the hallway shout. Small win on some tasks, extra parameters. Mostly gone. - Coupled input-forget. Enforce
i_t = 1 − f_t— the intake clerk and shredder share one dial. One fewer gate to learn, cheaper cell. Shows up in GRU-adjacent designs (foreshadowing). - Bidirectional LSTMs. Run two cabinets — one rolling forward down the hallway, one rolling backward — and concatenate their hidden states. Standard for tagging tasks where the whole sequence is available at once (NER, POS, speech recognition). Not useful for streaming generation.
Shredder bias at zero is a silent killer. A freshly-initialised LSTM with zero biases has σ(0) = 0.5 forget gates — which means every step the drawer loses half its contents. After twenty steps that's 0.5²⁰ ≈ 10⁻⁶: you just reintroduced the vanishing problem you paid four weight matrices to fix. The fix, from Jozefowicz et al. 2015, is to initialise the forget-gate bias to 1 (or higher). Then σ(1) ≈ 0.73 and the cabinet defaults to “leave the paper alone.” Not every framework does this by default — check yours.
Packed vs padded sequences. PyTorch's nn.LSTM accepts both, but feeding padded tensors means the cabinet rolls through padding positions doing real compute — slower and, if you forget to mask the output, wrong. Use pack_padded_sequence for any batch with variable lengths.
Hidden state vs cell state confusion. h_t is what the hallway reader said aloud — what the next layer sees and what you decode from. c_t is the drawer itself — the memory channel that doesn't leave the cabinet directly. If you pass a sequence with hidden=None, PyTorch zeros both h_0 and c_0; if you carry state between chunks (stateful LSTM), carry both. Mixing them up is the single most common LSTM bug.
Gradient clipping is still needed. The filing cabinet solves vanishing gradients. It does not solve exploding gradients. Clip the gradient norm at 1–5 before calling optimizer.step(), or watch one spike wipe your weights.
Generate sequences of length 100. The first 10 positions hold a random pattern of symbols (0–9), the next 89 are a blank filler, and the final position is a “recall” marker. Target at the marker: the first symbol of the pattern. The network has to remember position 0 across 99 time steps.
Train two models: nn.RNN and nn.LSTM, same hidden size, same optimiser, same step count. Plot the loss curves.
What you'll see: the vanilla RNN plateaus at chance — its gradient dies before the marker's signal can reach position 0 to teach it anything. The filing cabinet solves the task cleanly. This is not a subtle gap — it's 90% accuracy versus coin-flip — and it's exactly the failure Hochreiter and Schmidhuber set out to fix.
Bonus: remove the forget-bias-of-one trick from your LSTM. How much slower does it train? Sometimes it doesn't converge at all. One-line bug — now you'll spot it at a glance.
What to carry forward. The LSTM is two pieces of furniture: a cell state that carries memory through time on an additive path — the filing cabinet drawer, with no matrix and no squashing derivative in its way, so the gradient survives — and three gates that decide what to shred, what to file, and what to read aloud. That addition is why the cabinet survives the whispered telephone chain the RNN couldn't: every clerk along the hallway can open the drawer without rewriting its contents. The idea outlived the architecture. Residual connections are the same move applied to depth. Transformer residual streams are the same move applied to layers. State-space models are the same move with continuous dynamics. All of them are descendants of one realisation: give gradient a highway with a derivative of 1.
Next up — GRU. Three gates and two states works — but look at the roles. The shredder and the intake clerk are almost mirror images: one says “let the old stuff through,” the other says “let the new stuff in.” Are they really doing distinct jobs, or could we collapse them into one dial? In 2014 Cho et al. tried exactly that. The Gated Recurrent Unit is the streamlined cabinet — two gates, one state, roughly the same empirical performance on most tasks. We'll derive it, compare parameter counts side-by-side, and figure out when the lighter version wins.
- [01]Hochreiter, Schmidhuber · Neural Computation 1997 — the original paper
- [02]Gers, Schmidhuber, Cummins · Neural Computation 2000 — introduced the forget gate
- [03]Jozefowicz, Zaremba, Sutskever · ICML 2015 — forget-bias-of-one and variant search
- [04]Andrej Karpathy · blog post, 2015 — the character-level LSTM that launched a thousand demos
- [05]Zhang, Lipton, Li, Smola · d2l.ai