Q-Learning

Learn a value function from experience, one update at a time.

Medium

~15 min read

·lesson 2 of 6

Imagine you're mid-game on a giant candy-grabbing board. You're standing on some square, four moves are legal, and you want the one that eventually hands you the most candy — not right now, but by the end of the game. What you would kill for is a cheat sheet — a massive lookup table with one row per square and one column per move, and inside each cell a single number: “from square X, if I take action A and then play smart forever after, the total candy I eventually collect is this much.”

Call that number Q(s, a) and that table the Q-table. If you had it, the game would be trivial: look up the row for your current square, scan the columns, pick the cell with the biggest number, play that move. Repeat. You'd be optimal. You'd also be cheating, because of course you don't have the table — nobody handed it to you.

Q-learning is the algorithm that fills in the cheat sheet by playing. You start with every entry scribbled at zero. You take actions, land in new squares, bank rewards. Each transition nudges exactly one cell — one row, one column — a little closer to truth. Do this ten million times and the lookup table becomes the lookup table you wish you'd had at the start.

So we flip the question. Instead of planning with a known model, can we learn the optimal action-value function directly, by doing? That is the promise of model-free reinforcement learning, and the canonical algorithm is Q-learning — Chris Watkins' 1989 thesis, still one of the most widely used ideas in the field. It is four lines of code. It converges to the optimal policy. It does not need the model. It is the algorithm you implement first and the algorithm DeepMind scaled to Atari twenty-four years later.

Q-learning (personified)

You do not know how the world works. That is fine. Take an action. See what happens. Nudge the entry in your cheat sheet toward what you just experienced. Do this ten million times. At the end you have a policy that is provably optimal. I require no map, no oracle, no teacher — just honest samples, a little patience, and one well-behaved table.

Here's the cheat sheet in detail. Rows are states, columns are actions, and each cell holds your current estimate of Q*(s, a) — the expected discounted return if you take action a in state s and then behave optimally forever after. You start the table at zero (or anywhere, really — it all washes out). Each time you act in the world you see a transition (s, a, r, s'). Then you apply exactly one update — to exactly one cell in the table, the one in row s and column a. That surgical locality is the point.

the Q-learning update — one cell, one correction, memorise this line

Q(s, a)  ←  Q(s, a)  +  α · [ r  +  γ · max Q(s', a')  −  Q(s, a) ]
                                      a'
              └──────┘     └────────────────────────┘
              old cell     TD target (what this entry should be now)

Read it from the inside. The bracket is a temporal-difference error — the gap between the old entry Q(s, a) and a fresher, data-informed estimate r + γ · max Q(s', a'). The fresher number uses one real sample (the reward r) and one bootstrapped guess (the best entry in the next row, row s'). You drag the old cell a step α toward the new one. That is it. That is Q-learning, and it's the Bellman equation from the MDP lesson reshaped into a sampled, per-cell correction instead of a full planning sweep.

Three things are happening at once, and they are all load-bearing:

Bootstrapping. We use the current cheat sheet to estimate future returns while we are still filling it in. This is circular but it works, because every update pulls each entry a little closer to truth, and the errors average out down the table.
The max over the next row. We score the next state by the best cell in row s', regardless of what we actually did next. This is why Q-learning learns Q*, not Q of whatever noisy policy we happened to be using.
Step size α. Small α = slow, stable, cell converges smoothly. Large α = fast, bouncy. Typical values: 0.1, 0.01. Same intuition as any SGD-style method — the update rule even looks like gradient descent on a single-cell loss, because that's basically what it is.

Here is a 5x5 gridworld with a goal cell (+1) and a pit (−1). The agent starts somewhere random, runs episodes, and you watch the cheat sheet fill in. Each cell on the grid shows the max Q-value at that state as a heatmap — think of it as peeking at the brightest entry in each row of the lookup table. Early episodes produce scattered, noisy estimates. Keep stepping. Watch the goal's value propagate backward, one cell per episode, until every row knows the way home.

Q-table update — one Bellman step at a time

4 states · 4 actions · tape length 12

Q(s, a) — green = higher, rose = lower

0.00

0.00update

0.00

tape[0] = s=s0, a=a1, r=+0, s'=s1

Bellman update — plugged in

Q(s,a) ← Q(s,a) + α·(r + γ·max Q(s',·) − Q(s,a))

Q(s,a)0.000

r0.00

γ0.900

max Q(s',·)0.000

target = r + γ·max0.000

TD error δ0.000

α·δ0.000

Q_new(s,a)0.000

intuition

α controls how much of the TD error we absorb into Q. γ controls how far the bootstrap peeks ahead. Watch the (s,a) cell drift toward the best neighbor of s'.

α0.30γ0.900

step0 / 11

This backward propagation of value is the entire vibe of TD learning. The goal cell's row knows its value first (it collects a real reward). Its neighbors' rows learn about it through the max term — their update looks one step ahead, finds a brighter cell, and pulls their own entry upward. Their neighbors learn about those. Every episode nudges information one more cell outward from the source. After enough episodes, every row in the lookup table knows — through the chain of bootstrapped maxes — how to get to the goal.

Q-table (personified)

I am the cheat sheet, assembled one honest guess at a time. Ask me Q(s, a) and I'll point you at row s, column a, and read out the number — how much return to expect if you take that entry and then behave like I told you to. I start out wrong about every cell. Every transition you feed me makes one of my entries slightly less wrong. Eventually argmax-ing my rows gives you the optimal policy, no planner required.

off-policy — why Q-learning scales

Notice the max in the update. It does not care what action the agent actually took next. Q-learning updates the (s, a) cell toward the value of the best possible entry in the next row, regardless of which column got played. That single design choice makes Q-learning off-policy: the policy it is learning (greedy w.r.t. the table) can be different from the policy it is behaving with (usually ε-greedy, sometimes random, sometimes a human). You can fill in the cheat sheet from a replay buffer of old experience, from human demonstrations, from another agent's logs — anything that looks like (s, a, r, s'). That is a big deal. Its on-policy cousin SARSA cannot do this; SARSA updates toward Q(s', a') for the action it actually took, so it is stuck learning about its own current behavior.

Here's the catch nobody warns you about. The cheat sheet has thousands of cells. If your behavior policy always picks the brightest entry in the current row, you will revisit the same handful of cells forever — and the rest of the table stays at zero, because you never visit those rows. You can't fill in an entry you never touch. The algorithm is mathematically obligated to sometimes guess.

That's the exploration vs exploitation trade-off. Exploit means “play the best cell in the current row” — cash in on what the table already claims. Explore means “pick a random column” — visit a cell you have low confidence in, collect a fresh sample, improve that entry. If we only exploit, the table freezes around whatever looked good first. If we only explore, we waste every step on flailing and the table is accurate but useless. The standard compromise is ε-greedy.

ε-greedy — the exploration knob

π(a | s)   =   { random action                with probability ε
                { argmax Q(s, a)               with probability 1 − ε
                         a
                         └── pick the brightest cell in row s ──┘

schedule:  ε_t   =   max(ε_min,  ε_0 · decay^t)          typical: ε_0=1.0, ε_min=0.05

Flip a biased coin. With probability ε, do something random to find out if any untouched entry is better than what the table currently claims. With probability 1 − ε, cash in on the brightest cell in the row. Early in training ε should be high — the table is blank, you know nothing, so exploit nothing. Late in training, ε should be low — the table is mostly filled in and the brightest entries are probably right. The schedule is a hyperparameter and, as we are about to see, it matters a lot.

ε-greedy — three decay schedules

hover to inspect any step

constantlinearexponential

ε_final0.05exp rate κ0.0050

t999

ε(const)0.100

ε(lin)0.050

ε(exp)0.050

Three schedules, one task. Constant ε = 0.1 is a common textbook default — it learns fast initially, then plateaus because it keeps throwing away 10% of its actions on random noise forever. Decaying ε starts wide and narrows to a small floor — best of both worlds in practice. Decay to zero looks tempting (pure exploitation at the end!) but is dangerous: if the agent latches onto a suboptimal row of cells early, it can never escape because it stops exploring entirely — the entries in the unvisited rows remain forever at their initial zero. The practical answer almost always: decay to a small non-zero floor.

ε-greedy (personified)

I am the knob that trades curiosity for confidence. Turn me up and your agent wanders into unexplored rows — useful when the cheat sheet is mostly blank. Turn me down and your agent commits to the brightest cell it knows — useful when the table is mostly filled in. Turn me to zero and your agent freezes its opinions forever; hope the unvisited cells weren't hiding anything good. The art of reinforcement learning is mostly tuning me over time.

Q-learning is famously short. Three layers, same algorithm, each one a little more production-ready than the last. Pure Python on a gridworld with a literal 2D list as the lookup table, NumPy with a replay buffer and the table as an array, PyTorch with a neural network standing in for the table — the last of which is DQN, and the bridge between the tabular toy and modern deep RL.

layer 1 — pure python · q_learning_gridworld.py

python

import random

# 5x5 gridworld; state = (row, col); 4 actions: up, down, left, right.
# Q is our cheat sheet: 25 rows (one per square), 4 columns (one per move).
N_STATES, N_ACTIONS = 25, 4
Q = [[0.0] * N_ACTIONS for _ in range(N_STATES)]   # every cell starts at zero

alpha, gamma = 0.1, 0.95
eps, eps_min, eps_decay = 1.0, 0.05, 0.995

def step(s, a):                             # pretend-environment; returns (s', r, done)
    ...                                     # mechanical gridworld transitions

for episode in range(1000):
    s = env_reset()
    done = False
    while not done:
        # ε-greedy action selection — explore a random column, or exploit the row's best cell
        if random.random() < eps:
            a = random.randrange(N_ACTIONS)           # explore: random entry in row s
        else:
            a = max(range(N_ACTIONS), key=lambda a: Q[s][a])   # exploit: brightest cell

        s_next, r, done = step(s, a)

        # Q-learning update — the one line that matters.
        # Nudge the (s, a) cell toward "reward + γ * best cell in next row".
        td_target = r + (0 if done else gamma * max(Q[s_next]))
        Q[s][a]   = Q[s][a] + alpha * (td_target - Q[s][a])

        s = s_next

    eps = max(eps_min, eps * eps_decay)     # decay exploration each episode

stdout

episode   0: return = -1.00  ε=0.90
episode 200: return =  1.00  ε=0.37
episode 500: return =  1.00  ε=0.06
learned policy: → → → → G
                ↑         ↑
                ↑ (pit)   ↑
                ↑ ←       ↑
                ↑ ← ← ← ←

Now scale the two things that break in pure Python: batched updates and off-policy reuse of data. The idea of a replay buffer is simple — stash every transition you see in a ring buffer, and on each training step sample a random minibatch from it. You decorrelate consecutive samples (which are highly correlated in time), and you reuse every transition many times (which is how you actually learn efficiently from limited data). The cheat sheet becomes a NumPy array, and each step updates 64 cells at once.

layer 2 — numpy + replay buffer · q_learning_buffer.py

python

import numpy as np
from collections import deque

Q = np.zeros((N_STATES, N_ACTIONS))         # the cheat sheet, as an array
buffer = deque(maxlen=50_000)               # fixed-size replay memory

alpha, gamma = 0.1, 0.95

def update_batch(batch):
    s, a, r, s_next, done = batch
    # vectorised TD target — one line, all 64 transitions (64 cells) at once
    td_target = r + gamma * (1.0 - done) * Q[s_next].max(axis=1)   # max over each next row
    td_error  = td_target - Q[s, a]
    Q[s, a]  += alpha * td_error             # elementwise cell updates, no python loop

for t in range(100_000):
    # 1) interact with the env (behavior policy = ε-greedy on the table)
    a = np.argmax(Q[s]) if np.random.rand() > eps else np.random.randint(N_ACTIONS)
    s_next, r, done = env.step(a)
    buffer.append((s, a, r, s_next, float(done)))

    # 2) learn from a random minibatch of past experience
    if len(buffer) >= 64:
        batch = map(np.array, zip(*random.sample(buffer, 64)))
        update_batch(tuple(batch))

    s = env.reset() if done else s_next

pure python → numpy

Q[s][a] = Q[s][a] + α*(td - Q[s][a])←→Q[s, a] += α * (td - Q[s, a])

— same cell update, vectorised over a whole minibatch of rows

update on-the-fly after every step←→buffer.append(...) then sample later

— replay buffer = decorrelated, reusable transitions

max(Q[s_next])←→Q[s_next].max(axis=1)

— argmax over each next row — 64 lookups at once

The tabular version cannot scale past toy problems — the cheat sheet has one row per state, and state spaces in the real world are huge (or continuous). You literally cannot allocate a table with a row per Atari frame. The fix, due to Mnih et al.'s 2013 DQN paper, is obvious in retrospect: replace the lookup table with a neural net Q_θ(s, a) that generalizes across similar states instead of storing each entry by hand. The update stays structurally identical; only now the targets are fit by gradient descent on an MSE (or Huber) loss, and the bag of tricks gets bigger.

layer 3 — pytorch DQN on CartPole · dqn_cartpole.py

python

import torch, torch.nn as nn, torch.nn.functional as F
from collections import deque
import random, gymnasium as gym

# The "cheat sheet" is now a neural net: input a state, output one Q-value per action.
class QNet(nn.Module):
    def __init__(self, n_in, n_out):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(n_in, 128), nn.ReLU(),
                                 nn.Linear(128, 128),  nn.ReLU(),
                                 nn.Linear(128, n_out))
    def forward(self, s): return self.net(s)

env      = gym.make("CartPole-v1")
q, q_tgt = QNet(4, 2), QNet(4, 2)           # online net + target net
q_tgt.load_state_dict(q.state_dict())
opt      = torch.optim.Adam(q.parameters(), lr=1e-3)
buffer   = deque(maxlen=100_000)
gamma, eps = 0.99, 1.0

for episode in range(600):
    s, _ = env.reset()
    done, ep_return = False, 0.0
    while not done:
        # ε-greedy on the online Q-net (the learnable cheat sheet)
        if random.random() < eps:
            a = env.action_space.sample()
        else:
            with torch.no_grad():
                a = int(q(torch.tensor(s, dtype=torch.float32)).argmax())

        s_next, r, term, trunc, _ = env.step(a)
        done = term or trunc
        buffer.append((s, a, r, s_next, float(done)))
        s, ep_return = s_next, ep_return + r

        # DQN learning step
        if len(buffer) >= 64:
            batch = random.sample(buffer, 64)
            s_, a_, r_, sn_, d_ = map(torch.tensor, zip(*batch))
            s_, sn_ = s_.float(), sn_.float()

            with torch.no_grad():
                td_target = r_ + gamma * (1 - d_) * q_tgt(sn_).max(1).values   # target net!
            td_pred = q(s_).gather(1, a_.unsqueeze(1)).squeeze(1)
            loss    = F.smooth_l1_loss(td_pred, td_target)                     # Huber loss

            opt.zero_grad(); loss.backward(); opt.step()

    # slow-moving target: hard copy every N episodes (or Polyak every step)
    if episode % 10 == 0:
        q_tgt.load_state_dict(q.state_dict())
    eps = max(0.05, eps * 0.995)

stdout

episode  50: return =  32.1  ε=0.74
episode 200: return = 141.6  ε=0.27
episode 500: return = 200.0  ε=0.05   ← CartPole solved

numpy → pytorch (DQN)

Q[s, a] (a table lookup)←→q(s).gather(1, a) (a forward pass)

— function approximator replaces the lookup table — same semantics, infinite rows

Q[s, a] += α * td_error←→loss.backward(); opt.step()

— gradient descent on (td_pred − td_target)² does the same cell nudge

max Q(s_next)←→q_tgt(sn_).max(1).values

— target network — a stale copy of the cheat sheet to keep the target stable

one buffer + one Q-table←→buffer + online net + target net

— two nets because bootstrapping off a moving target diverges

Gotchas

ε never decays: you will exploration your way to a mediocre plateau forever. Symptom: learning curve rises, then stops climbing suspiciously early. Always decay ε, always to a small positive floor.

Target network never updates: Q-values drift to ∞ because the target and the prediction are the same network and the whole thing is one giant positive-feedback loop. Symptom: losses explode, returns collapse. Always copy q → q_tgt on a fixed schedule (or use Polyak averaging).

Too-small replay buffer: the distribution of transitions in the buffer becomes dominated by the agent's recent (narrow) policy. Old skills get overwritten — catastrophic forgetting. Typical DQN buffer: 10⁵–10⁶ transitions.

Unnormalised rewards: a reward of +100 once per episode and 0 otherwise makes the TD-target range enormous and destabilises gradient descent. Clip rewards to [−1, 1] or scale them; the original DQN paper does this on all Atari games and it is not a detail.

Solve CartPole with DQN, then look at your ε curve

Copy the Layer-3 code, run it for 600 episodes on CartPole-v1, and plot the per-episode return. You should see a messy upward sweep — noisy for the first 100 episodes, climbing sharply around episode 150–300, plateauing at 200 (the max episode length) once the policy is solid.

Then plot ε on the same axis. Notice the correlation: returns start climbing seriously once ε has decayed below ~0.3, i.e. once the agent mostly exploits the entries it trusts. Run again with ε_min = 0.0; you'll often see returns freeze mid-climb because the policy commits too early and the unvisited cells never get touched.

Bonus: implement Double DQN (two-line change in the target calculation) and check that the max Q-values grow more slowly than plain DQN. That slow growth is the whole point.

What to carry forward. Q-learning is a cheat sheet you fill in by playing — one row per state, one column per action, one cell updated per transition toward a Bellman-style target. It is off-policy by construction, provably convergent in the tabular case, and generalises to deep RL by swapping the lookup table for a neural net. The three DQN tricks — replay buffer, target network, Huber loss — are not cosmetic; each one patches a specific failure mode of naïve function-approximation Q-learning. ε-greedy is the behavior policy that makes the whole table fillable at all: no exploration, no visits, no samples, no entries. Decaying ε on a sensible schedule is the single most common tuning lever.

Next up — Policy Gradients. Q-learning learns a value table and extracts a policy by argmax-ing each row. Policy gradients skip the cheat sheet entirely and directly parameterise the policy itself as a neural net, optimising it by following the gradient of expected return. That buys us two huge things: it handles continuous action spaces natively (argmax over a continuous column is awkward), and it learns stochastic policies when they are optimal (Q-learning can only represent deterministic greedy policies). REINFORCE, actor-critic, PPO — everything modern — lives on the other side of that door.

References

[01]
Learning from Delayed Rewards
Christopher J. C. H. Watkins · PhD thesis, University of Cambridge — the original Q-learning · 1989
[02]
Q-learning
Watkins, Dayan · Machine Learning 8(3–4), 279–292 — the convergence proof · 1992
[03]
Playing Atari with Deep Reinforcement Learning
Mnih et al. · arXiv preprint — the first DQN paper · 2013
[04]
Human-level control through deep reinforcement learning
Mnih et al. · Nature 518, 529–533 — DQN on 49 Atari games · 2015
[05]
Deep Reinforcement Learning with Double Q-learning
van Hasselt, Guez, Silver · AAAI 2016 — the overestimation fix · 2015
[06]
Reinforcement Learning: An Introduction (2nd ed.), Chapter 6
Sutton, Barto · MIT Press — the canonical TD-learning reference · 2018