Markov Decision Processes

States, actions, rewards, transitions — the RL contract.

Medium

~15 min read

·lesson 1 of 6

Picture a board game. You and the game are standing on a square. You pick a move. The dice roll, and they hand you a new square — maybe the one you aimed for, maybe a sideways one because the board is slippery — and sometimes the square you land on gives you a candy. That is the whole game. Square, move, dice, candy. Repeat forever, or until someone wins.

Here's the weird part. The game doesn't remember how you got to this square. It doesn't care whether you walked in straight from the start or ricocheted off five other squares first. All it sees is where your piece is right now. That is all it sees. Your future depends on your present square and nothing else. That's what “Markov” means — the board game with no memory.

Every reinforcement learning algorithm you've heard of — the thing that beat Go, the thing that turned a base language model into ChatGPT — is, underneath, a strategy for this board game. The rest of this page is what the square, the move, the dice, and the candy become when the board has ten billion squares and the dice are stochastic physics.

Supervised learning assumes somebody else already did the hard part: labeled the data. You get (x, y) pairs and your job is to learn the map. Reinforcement learning throws that away. There are no labels. There is a board, a player, and a candy jar that clinks or doesn't depending on what you do. The only feedback is that clink. Figure it out.

The contract is almost shockingly simple. You see a state s — your square. You pick an action a — your move. The environment rolls the dice and replies with a reward r — the candy — and a new state s' — your new square. The loop repeats, possibly forever. That's it. Every piece of RL — Q-learning, policy gradients, PPO, AlphaGo, ChatGPT's RLHF stage — is a different way of squeezing useful behavior out of that tiny contract.

To reason about the loop we need a formalism. The field settled on the Markov Decision Process — a 5-tuple that pins down what the board is, what the moves are, how the dice roll, and what “doing well” means. This lesson is an honest build-up of that tuple, plus the one equation (Bellman's) that every RL algorithm in the next five lessons will solve, copy, or approximate.

Reinforcement learning (personified)

I don't need labels. Give me a board, a move set, and a candy that clinks when things are good. I'll figure out the rest. It will take me a hundred million tries, but I will figure out the rest.

Before any math, the picture. Every RL textbook draws this diagram because it is genuinely the whole setup — the board game loop with the players relabeled.

   ┌──────────────┐   action aₜ    ┌──────────────────┐
   │              │   (your move)  │                  │
   │    AGENT     │ ─────────────▶ │   ENVIRONMENT    │
   │   (you)      │                │   (the board     │
   │              │ ◀───────────── │    + the dice)   │
   └──────────────┘   rₜ₊₁, sₜ₊₁   └──────────────────┘
         ▲           (candy, new square)       │
         │         state sₜ (your square)      │
         └─────────────────────────────────────┘

   time →  s₀, a₀, r₁, s₁, a₁, r₂, s₂, a₂, r₃, s₃,  …

the agent-environment loop — the board game, formalized

The agent is whatever decides moves — you, a lookup table, a neural network. The environment is everything else: the board, the dice, the opponent, the physics. A trajectory is just the alternating sequence of squares, moves, and candies that the loop produces. The goal is to pick moves so that the candies along the trajectory add up to something large.

To make “add up to something large” precise we need to define what the game is. That's the five-tuple.

MDP — the 5-tuple that pins down the board game

M   =   ⟨ S,  A,  P,  R,  γ ⟩

S   =   set of states             every square your piece could be on
A   =   set of actions            every move you could make
P(s'|s,a)   =   transition prob.  the dice — chance of landing on s' from (s, a)
R(s, a)     =   reward            the candy — scalar you get for doing a in s
γ ∈ [0, 1)  =   discount factor   candy-today vs candy-in-ten-turns

S is every square — they can be a tiny discrete set (9 squares on a tic-tac-toe board) or an enormous continuous one (every possible 224×224 RGB frame of an Atari game). A is the move set — four arrows on a grid, every possible torque on a robot arm. P is the dice: given a square and a move, it says probabilistically which square comes next. R is the candy jar: scalar in, nothing fancy. And γ is the strange one — a number between 0 and 1 that controls how much you care about candy you could collect ten turns from now.

One subtle constraint binds all of this together: the Markov property. The next square depends only on the current square and move, not on the full history of how you got here. Formally P(s' | s, a, s_{t-1}, a_{t-1}, …) = P(s' | s, a). The present screens off the past. The board game has no memory.

If the real world isn't Markov, engineer the state until it is. A single Atari frame isn't Markov — you can't tell which way the ball is moving from one still image — so DQN stacks the last 4 frames into the state. Poker isn't Markov from your cards alone — you also need the betting history — so put the betting history in the state. Rewrite the square until it carries every fact the future depends on. The trick to RL engineering is usually not the algorithm; it's the state representation.

Abstract enough. Let's make it concrete. Below is a literal board — a 5×5 grid with walls, a goal, and candies. Click a move; watch the dice roll you into a new square and the candy land on the scoreboard. This is a fully-specified MDP — you can read off S, A, P, and R from the UI directly.

5×5 MDP — move the agent, read the tuple (s,a,r,s')

γ = 0.9 · step cost = -0.04

grid world — arrow keys or buttons

wall

-10

wall

+10

trajectory · (s, a, r, s')

no steps yet — take an action.

last a—

last r0.00

state(0,0)

A few things to notice. The square is just your (row, col) — nothing about past moves is carried forward, and nothing needs to be (this game is exactly Markov by construction). The move set is tiny: four directions. The dice are stochastic (with some slip probability you skid sideways instead of forward), and the candy is usually a small negative per step plus a big positive at the goal. That's how the board nudges you toward short paths — every wasted step costs you a sliver of candy.

Every gridworld you'll read about in Sutton & Barto is a tiny variant of this. The reason the field keeps using them is that they're small enough to solve exactly — you can literally enumerate every square and every move — which makes them the ideal lab for intuition about algorithms that then scale up to Go and Dota.

You want to maximize total candy over a trajectory. But “total” is ambiguous — over what horizon? If the game runs forever and candies are bounded positive, the sum diverges. That's where γ comes in. A candy today is not worth the same as a candy in ten turns. The formal quantity you actually maximize is called the return:

the discounted return — candy, weighed by how soon you get it

                     ∞
G_t   =   r_{t+1}  +  γ · r_{t+2}  +  γ² · r_{t+3}  +  …   =   Σ  γᵏ · r_{t+k+1}
                                                             k=0

γ = 0   →   myopic; only the next candy matters.
γ = 1   →   infinite horizon; every candy weighted equally (sum may diverge).
γ = 0.9 →   effective horizon ≈ 1 / (1 − γ) = 10 turns.
γ = 0.99 →  effective horizon ≈ 100 turns.

Two things are doing work here. First, the math: any γ < 1 combined with bounded candy makes the infinite sum finite (geometric series). A potentially-undefined objective becomes a well-defined optimization problem — which is the whole point. Second, the intuition: γ is a dial for how much you care about candy you could collect later versus candy you could grab right now. A low γ produces a short-sighted player that pockets whatever candy is closest. A high γ produces a patient player willing to eat a small penalty now to unlock a big candy ten squares from here.

Discount factor γ (personified)

I am the horizon dial — candy-today versus candy-in-ten-turns. Crank me to 0.99 and your player plans a hundred moves ahead. Crank me to 0.5 and it will sell its grandmother for a candy it can grab in two moves. Pick me badly and your player is either a lunatic or a goldfish.

Drag the γ slider. The plot shows a 50-turn trajectory with a fixed candy sequence; the bars show how much each turn's candy contributes to G₀ after discounting. At γ = 0.5 the bars past turn 10 are visually nothing — you effectively don't see them. At γ = 0.99 the 50th turn still contributes meaningfully. This is the horizon effect you'll feel in every algorithm from here on.

discounted return — Σ γ^t · r_t over a 20-step tape

γ→0 greedy · γ→1 patient

r_t

γ^t (weight)

γ^t · r_t (term)

running G

1.000

+1.000

1.000

0.900

+0.000

1.000

-1

0.810

-0.810

0.190

0.729

+0.729

0.919

0.656

+0.000

0.919

0.590

+2.952

3.871

0.531

+0.000

3.871

0.478

+0.000

3.871

-1

0.430

-0.430

3.441

0.387

+0.387

3.828

0.349

+0.000

3.828

0.314

+0.000

3.828

0.282

+1.412

5.241

-1

0.254

-0.254

4.986

0.229

+0.229

5.215

0.206

+0.000

5.215

0.185

+0.000

5.215

0.167

+0.167

5.382

0.150

+0.000

5.382

-1

0.135

-0.135

5.247

undiscounted11.000

G (γ-discounted)5.247

γ0.900

raw Σr11.00

G = Σ γ^t r_t5.247

horizon 1/(1−γ)10.0

The thing we're learning is called a policy — your strategy, a rule for picking moves. Two flavors:

Deterministic: π(s) = a. “On square s, make move a.”
Stochastic: π(a | s) is a probability distribution over moves. “On square s, flip this weighted coin.”

To compare strategies we need a score. The score of being on square s while following policy π is the expected return from there — the expected candy total, discounted:

value functions — V and Q

V^π(s)     =   E_π [ G_t  |  s_t = s ]

Q^π(s, a)  =   E_π [ G_t  |  s_t = s,  a_t = a ]

V and Q relate by:
     V^π(s)  =  Σ_a  π(a | s) · Q^π(s, a)
     Q^π(s, a)  =  R(s, a)  +  γ · Σ_{s'}  P(s' | s, a) · V^π(s')

V is the square value — “how good is this square?” Q is the square-and-move value — “how good is this square if I move right first?” Most modern deep RL learns Q directly because the player can act greedily: a = argmax_a Q(s, a) — just look up every move from this square and take the best one.

Here's the recursion that holds everything together. The value of a square can be written in terms of the values of the squares one move ahead:

Bellman equation — the recursive identity of V

V^π(s)   =   Σ  π(a | s)  ·  [ R(s, a)  +  γ · Σ  P(s' | s, a) · V^π(s') ]
             a                                  s'
             └──────────┘    └──────┘    └──────────────────────────────┘
             weight over    immediate      γ-discounted value of
             moves          candy          where we'll land next

Read it out loud. “The value of this square under strategy π is the average, over moves π might take, of the candy I pick up plus the discounted value of the square I'll land on.” The recursion folds the whole infinite-horizon return into a one-step relationship. That is enormous. It means we can solve for V by iterating a local update instead of computing an infinite sum.

Analogously there's a Bellman optimality equation, where instead of averaging over moves the player picks the best one:

Bellman optimality — if you play perfectly

V*(s)   =   max  [ R(s, a)  +  γ · Σ P(s' | s, a) · V*(s') ]
             a                       s'

π*(s)   =   argmax  [ R(s, a)  +  γ · Σ P(s' | s, a) · V*(s') ]
              a                         s'

An optimal policy π* is one that maximizes V^π(s) for every square s simultaneously. A remarkable theorem (due to Bellman, then Puterman tidied it up) says such a strategy exists for every finite MDP — and while there can be ties between multiple optimal strategies, the optimal value is unique. Put differently: there is a single right answer in value-space, and solving the Bellman optimality equation recovers it.

Bellman equation (personified)

I turn an infinite sum into a one-step recursion. Give me the values of the squares you'll be on next, and I'll give you the value of the square you're on. Iterate me and I converge. Neural-network me and I become deep Q-learning. I am the backbone of this entire field.

Enough theory — let's run it. Same three-layer build as usual: pure Python first (a single Bellman backup on a 3-square toy), NumPy next (full value iteration on our 5×5 board), and finally a PyTorch sketch where the value function is a neural net trained by gradient descent (a preview of deep RL).

layer 1 — pure python · bellman_backup.py

python

# A 3-square toy MDP:  s0 ──right──▶ s1 ──right──▶ s2 (terminal, +10 candy)
# step-candy of -1 for every non-terminal move; γ = 0.9.

gamma = 0.9
states = ["s0", "s1", "s2"]
actions = {"s0": ["right"], "s1": ["right", "left"], "s2": ["stay"]}

# transition model: T[s][a] = (s', r) — where the dice land, what candy you get
T = {
    "s0": {"right": ("s1", -1.0)},
    "s1": {"right": ("s2", +10.0), "left": ("s0", -1.0)},
    "s2": {"stay":  ("s2",  0.0)},            # absorbing
}

V = {s: 0.0 for s in states}                  # initial guess

def bellman_backup(V):
    V_new = {}
    for s in states:
        # Bellman optimality: V*(s) = max_a [ candy + γ · V(next square) ]
        V_new[s] = max(r + gamma * V[s_next] for (s_next, r) in
                       (T[s][a] for a in actions[s]))
    return V_new

for i in range(21):
    V = bellman_backup(V)
    if i in (0, 1, 5, 20):
        print(f"iter {i:02d}  V = {[round(V[s], 4) for s in states]}")

# extract the greedy strategy — on each square, pick the move with highest value
policy = {s: max(actions[s], key=lambda a: T[s][a][1] + gamma * V[T[s][a][0]])
          for s in states}
print("converged — optimal policy:", policy)

stdout

iter 00  V = [0.0, 0.0, 0.0]
iter 01  V = [-1.0, -1.0, 10.0]
iter 05  V = [6.561, 7.29, 10.0]
iter 20  V ≈ [7.2898, 8.0998, 10.0]
converged — optimal policy: {s0: 'right', s1: 'right', s2: 'stay'}

Vectorise it. On a 5×5 board we have 25 squares and 4 moves. We can stuff the whole MDP into NumPy arrays — P is shape (S, A, S), R is (S, A) — and express one sweep of value iteration as a handful of tensor contractions.

layer 2 — numpy · value_iteration.py

python

import numpy as np

# 5x5 board: moves = {0:up, 1:right, 2:down, 3:left}
# candy +10 at the goal square (0, 4), step cost -0.04 elsewhere, wall at (3, 2).
S, A = 25, 4
gamma = 0.95

# build P[s, a, s'] and R[s, a] from the grid layout
P = np.zeros((S, A, S))
R = np.full((S, A), -0.04)
# ... (populate P, R from grid topology — omitted for brevity) ...

V = np.zeros(S)
for sweep in range(500):
    # Q(s, a) = candy(s, a) + γ · Σ_{s'} dice(s, a, s') · V(s')
    Q = R + gamma * P @ V                                 # shape (S, A)
    V_new = Q.max(axis=1)                                 # Bellman optimality
    delta = np.abs(V_new - V).max()
    V = V_new
    if delta < 1e-4:
        print(f"sweep {sweep:03d}  Δ = {delta:.2e}  ✓ converged")
        break

pi = Q.argmax(axis=1)                                     # greedy policy
print("V* grid:\n", V.reshape(5, 5).round(2))

stdout

sweep 000  Δ = 10.0000
sweep 010  Δ = 1.3421
sweep 050  Δ = 0.0074
sweep 112  Δ = 9.7e-05  ✓ converged
V* grid:
 [[ 5.32  6.58  8.10  9.83 10.00]
  [ 4.27  5.32  6.58  8.10  9.83]
  [ 3.34  4.27  5.32  6.58  8.10]
  [ 2.51  3.34   -∞   5.32  6.58]
  [ 1.76  2.51  3.34  4.27  5.32]]

pure python → numpy

for s in states: max over actions ...←→Q = R + γ · (P @ V); V = Q.max(axis=1)

— one matmul replaces the nested Python loop — the entire Bellman sweep is 2 lines

dict of transitions T[s][a] = (s', r)←→arrays P[s, a, s'] and R[s, a]

— dense tensors — wasteful for sparse boards, but vectorizable and GPU-friendly

extract policy with argmax in a loop←→pi = Q.argmax(axis=1)

— one call, whole strategy

On the 5×5 board value iteration converges in ~100 sweeps. But what if the board has 10⁴⁸ squares (Go) or continuous squares (a robot arm)? You can't store a table. Function approximation: replace the table with a neural network that maps square → value. The Bellman equation becomes a loss function, and you train by gradient descent on it. This is the entire premise of deep RL — and it's the bridge to the next few lessons.

layer 3 — pytorch · bellman_as_loss.py

python

import torch
import torch.nn as nn

# Neural value approximator: square features → scalar V(s).
class ValueNet(nn.Module):
    def __init__(self, state_dim, hidden=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.ReLU(),
            nn.Linear(hidden,  hidden), nn.ReLU(),
            nn.Linear(hidden, 1),
        )
    def forward(self, s):
        return self.net(s).squeeze(-1)

# Assume we have a replay of (s, a, r, s') transitions from a random policy.
V_net = ValueNet(state_dim=25)                         # one-hot square features
optim = torch.optim.Adam(V_net.parameters(), lr=3e-3)

for step in range(2001):
    s, r, s_next = sample_batch()                      # from replay buffer
    with torch.no_grad():
        target = r + gamma * V_net(s_next)             # Bellman target — TD(0)
    pred = V_net(s)
    loss = ((pred - target) ** 2).mean()               # regression on Bellman residual
    optim.zero_grad(); loss.backward(); optim.step()
    if step % 500 == 0:
        print(f"step {step:04d}  bellman loss = {loss.item():7.4f}")

stdout

step 0000  bellman loss = 23.8412
step 0500  bellman loss =  0.6194
step 2000  bellman loss =  0.0381
V_net(start square) ≈ 5.34  (tabular V* = 5.32)

numpy → pytorch

V = np.zeros(S) # a table, one entry per square←→V_net = ValueNet(...) # a parametric function

— from a value per square to a function that generalizes across squares

V_new = (R + γ P @ V).max(axis=1)←→target = r + γ · V_net(s_next); MSE(V_net(s), target)

— same Bellman equation, now a regression target for gradient descent

sweep until Δ < ε←→train until the loss plateaus

— fixed-point iteration becomes stochastic optimization — welcome to deep RL

Gotchas

Undiscounted infinite returns: if you set γ = 1 on a continuing game with non-zero candy, G_t is divergent and nothing is optimal — the Bellman equation has no finite fixed point. Your loss will look fine for a while, then explode. Keep γ < 1.

Confusing V and Q: V(s) is “how good is this square,” Q(s, a) is “how good is this square if I make move a first.” Q-learning learns Q because you can act greedily from it without knowing the dice. V-learning needs a model of P to act.

Expected vs. realized return: V^π(s) = E[G_t] is an expectation. Any single rollout gives a single realized G_t, which can be wildly off the mean in a stochastic game. Never judge a strategy on one episode.

Unseeded environments: gym-style environments roll stochastic dice. Without env.seed(k) and np.random.seed(k) your experiments are non-reproducible — and RL is famously brittle enough that seed-to-seed variance can swallow real algorithmic improvements. Seed everything, log the seed, report median across at least 5 seeds.

Solve a 5×5 board with value iteration

Build the MDP: a 5×5 board with a goal square at (0, 4) giving candy +10 (terminal), a wall at (3, 2) that can't be entered, a step cost of -0.04 on every other square, and γ = 0.95. Moves are {up, right, down, left}; bumping into a wall or the board edge leaves your piece in place.

Fill in the (25, 4, 25) transition tensor, then run value iteration until max Δ < 1e-4. Reshape V* back to (5, 5) and render it as a heatmap (matplotlib imshow). Overlay the greedy strategy as arrows pointing out of each square using plt.quiver.

Bonus: add a 10% slip probability (the dice send you perpendicular to your intended direction with prob 0.05 each side). Watch the value function and strategy change — the optimal route now avoids corridors next to the cliff, because when the dice misbehave, playing close to the edge is risky even if it's shorter.

What to carry forward. An MDP is a 5-tuple ⟨S, A, P, R, γ⟩ — squares, moves, dice, candy, horizon dial — plus the Markov property, which says the game has no memory. Your goal is to maximize the discounted return G_t = Σ γᵏ r_{t+k+1}. A strategy π has a value function V^π, which satisfies the Bellman recursion — a one-step equation that folds up the whole infinite horizon. Solving or approximating this recursion is reinforcement learning. Value iteration does it exactly on a table; deep RL does it approximately with a neural net.

Next up — Q-Learning. We've been assuming we know how the dice roll — that P(s' | s, a) is handed to us. In real life you don't get to peek at the dice. Q-learning is the first RL algorithm that doesn't need P — it learns the value of every square-and-move pair purely from watching actual rolls happen, building up a kind of cheat sheet square by square. It's the step from planning to actually learning, and it's the direct ancestor of DQN.

References

[01]
Reinforcement Learning: An Introduction (2nd ed.)
Sutton & Barto · MIT Press, 2018 — the canonical RL textbook
[02]
Markov Decision Processes: Discrete Stochastic Dynamic Programming
Puterman · Wiley, 1994 — the formal measure-theoretic treatment
[03]
Spinning Up in Deep RL — Key Concepts
OpenAI (Joshua Achiam) · the cleanest modern intro to the MDP formalism
[04]
Dynamic Programming
Bellman · Princeton University Press, 1957 — where the equation came from