REINFORCE

The cleanest policy-gradient algorithm — and its variance problem.

Hard
~15 min read
·lesson 4 of 6

Picture a patron at a casino. Not the movie version — no tuxedo, no James Bond. Just someone on a stool in front of a slot machine with a handful of arms, each one labeled with a different action. They pull an arm. Lights flash. The machine nudges them into a new state and coughs up (or swallows) some chips. They pull another. And another. They don't know which arm is best. They don't even know if they're playing the right kind of slot. They just pull, watch the chips move, and keep notes.

At the end of the night they close the tab, tally the whole evening into one number — total payout, good or bad — and turn around to the row of arms they pulled: pulling you paid off tonight, do more of that tomorrow. You over there, the one I pulled in the third round, same goes for you. Don't pull this one — that whole night was a loss. No coach at their shoulder telling them which arm was right at each moment. No instant replay. Just a whole-night tally, and a rule for redistributing that tally back to every arm they touched.

That's REINFORCE. It's the simplest thing that can possibly work in reinforcement learning. Run a policy, play a full episode, wait till the slot machine stops spinning, tally the return, and reinforce every action in proportion to how the night went. No critic, no target network, no replay buffer. Thirty lines of code. The patron walks out knowing slightly better slots to pull tomorrow.

You have a policy π_θ(a|s) — a neural network that takes a state and spits out a distribution over actions (which arm to pull). You want to tune θ so the policy racks up more reward. The supervised move would be to ask: what was the right arm at each step? Run backprop on that. But in an MDP nobody tells you the right arm. You only find out — later, often much later — whether the whole night paid off.

REINFORCE is also the conceptual parent of every policy-gradient method you've ever heard of. PPO, TRPO, A2C, RLHF — they're all REINFORCE with better variance control or more stability. If you get the casino tracker in your bones, the rest of the family is a list of patches.

REINFORCE (personified)
I'm the patron at the slot machine. Sample a trajectory, wait till the episode's over, tally the payout, and weight each arm's log-probability by the return that followed. My gradients are unbiased and obscenely noisy. Everyone who came after me exists to reduce my variance.

Here's the whole algorithm on a single page. Everything else in this lesson is commentary on these three lines.

REINFORCE — the full update, one episode at a time
for each episode:

    1.  τ   =  (s₀, a₀, r₁, s₁, a₁, r₂, ... , s_{T-1}, a_{T-1}, r_T)
            ← sampled by running π_θ in the environment
               (the patron playing the slot machine for a full night)

    2.  G_t =  Σ_{k=t}^{T-1}  γ^{k-t} · r_{k+1}       ← return from step t onward
                                                        (the tally for arm-pull t
                                                         and every pull after it)

    3.  θ   ←  θ  +  α · Σ_{t=0}^{T-1}  ∇_θ log π_θ(a_t | s_t) · G_t
                                                     (do more of the arms whose
                                                      tally came out positive;
                                                      do less of the others)

Read that bottom line slowly, because it's the whole lesson. It's the policy gradient theorem turned into an update. For every (s_t, a_t) pair in the trajectory — every arm the patron pulled, and the state of the slot machine when they pulled it — you compute the gradient of log π_θ(a_t | s_t), the “if I wiggled θ, how would the log-probability of this arm-pull change?” score. Then you scale it by G_t, the tally of what actually happened after that pull. Sum across the trajectory. That's your gradient. It's gradient ascent — we're climbing expected return, not descending a loss — so the update adds rather than subtracts.

The reveal: if pulling an arm led to a positive tally, G_t is big and positive, so the update shoves log π_θ(a_t|s_t) up — the policy becomes more likely to pull that arm next time in that state. If the tally was tiny or negative, the update pushes it down. The casino never has to tell the patron which arm was the right one. The payout at the end of the night tells them how hard to reinforce the arm they happened to choose. That's the whole game. That's why this thing works at all.

Watch it learn. The widget below runs REINFORCE on CartPole in your browser. Episode return (the length of time the pole stays up, capped at 500) is plotted against training episodes. A random policy — the patron pulling arms at random — lasts about 20 steps. A solved policy lasts 500. In between is a lot of noise, because in between is a lot of nights where the tally swung for reasons that had very little to do with the arm on any given pull.

REINFORCE training — noisy climb
simulated bandit · 500 episodes · variance is the feature
final mean0.952
final σ0.112
episode500 / 500

Three things to notice. First, it does learn — episode return climbs from ~20 toward 500 over a few hundred episodes. The casino tracker is tuning itself. Second, the curve is brutally jagged. Two episodes in a row with nearly identical policies can return 50 and 400; REINFORCE just eats the variance. Third, the slope is impatient in the middle and flattens near the ceiling — classic RL learning dynamics. The cheap wins come first, the last few percent come slowly.

So REINFORCE works. Why does anyone use anything else? Because the gradient estimator is unbiased but has catastrophic variance. This is the part of the casino story that bites. One full night at the slot machine is one sample from a massive distribution over possible nights. Your policy is stochastic — it has to be, for exploration — and the slot machine is stochastic, and the patron played for a few hundred pulls, and the tally at the end is one draw from the joint distribution of all of it. The same patron with the same policy on the same slot machine can walk out up 400 chips on Monday and down 50 on Tuesday. REINFORCE just eats that.

It gets worse. A night with a great tally might have one brilliant arm-pull followed by a string of lucky random ones that would have failed on any other night. REINFORCE doesn't know which pull was the good one — it reinforces every arm pulled that night in proportion to the final tally. The signal is in there, but it's buried under “the patron got lucky seven times in a row on arms that usually lose.” You need many, many nights to average the luck out. That's the variance problem, and it's the entire reason the rest of policy-gradient research exists.

the three standard variance-reduction moves
  (1)  baseline b(s_t):

       θ ← θ + α · Σ_t  ∇log π_θ(a_t|s_t) · ( G_t − b(s_t) )

       subtracting b(s) doesn't change the expectation (∇log π sums to zero
       in expectation) but shrinks variance dramatically. Usually b(s) = V_φ(s),
       a learned value function trained by regression on returns.


  (2)  normalized returns:

       Ĝ_t = ( G_t − mean(G) ) / ( std(G) + ε )

       z-score the returns within a batch. Cheap, no extra network.


  (3)  reward-to-go (already baked into G_t above):

       use ONLY the rewards after time t, not the whole trajectory's return.
       arm-pulls at step t have no causal effect on payouts BEFORE step t,
       so including them is just noise.

The baseline is the big one. Replace the raw tally G_t with G_t − b(s_t) and you're now reinforcing an arm-pull based on how much better it did than expected at that state. If b(s) is the value function V(s), the thing you're multiplying the log-prob by is the advantage A(s, a) = Q(s, a) − V(s) — literally “how much better is this arm than the average arm at this state on this slot machine.” That's the conceptual leap to actor-critic and to PPO.

Three training curves on CartPole, same seed: vanilla REINFORCE, REINFORCE with a learned baseline, and REINFORCE with a baseline plus normalized returns. Same algorithm at the core — but the second and third curves are meaningfully smoother and converge faster.

variance reduction — three flavours of REINFORCE
same seed · bands shrink as baseline → advantage
final return↑ higher better
REINFORCE
0.951
REINFORCE + baseline
0.987
REINFORCE + advantage
0.990
episodes to 80% of optimum↓ lower better
REINFORCE
0 ep
REINFORCE + baseline
202 ep
REINFORCE + advantage
183 ep
gradient estimator variance↓ lower better
REINFORCE
0.459 σ²
REINFORCE + baseline
0.104 σ²
REINFORCE + advantage
0.031 σ²

The gap isn't about the final score — all three can solve CartPole eventually. The gap is about how many episodes it takes and how confidently you can trust the learning curve. With a baseline, the gradient variance drops and the policy updates in a straighter line toward the optimum. Stack both tricks and you get the practical version of REINFORCE that shows up in real codebases, usually as a baseline comparison for PPO.

Reward-to-go (personified)
I am the edit that stopped paying out arm-pulls for things they couldn't possibly have caused. The reward you got on pull 3? Arm 7, pulled on pull 7, had nothing to do with it. Cross it out. You'd think this is obvious, but early policy-gradient code routinely multiplied log-probs by the full-trajectory tally. I cut the variance in half for free.

Three layers, as always. Pure Python to see the casino tracker's skeleton — run an episode, compute returns, nudge θ. NumPy to vectorise the returns across a batch of episodes. PyTorch to hand the gradient work to autograd via Categorical.log_prob. Same algorithm, three idioms, each shorter than the last.

layer 1 — pure python · reinforce_scratch.py
python
import random, math

# Assume a tiny environment and a policy π_θ parameterised by θ.
# For each state s, π(a|s) is e.g. a softmax over logits = θ @ features(s).

def run_episode(env, policy):
    traj = []
    s = env.reset()
    while True:
        probs  = policy.probs(s)                       # π(·|s) as a list
        a      = random.choices(range(len(probs)), weights=probs)[0]
        s2, r, done = env.step(a)
        traj.append((s, a, r))
        if done: return traj
        s = s2

def returns_from(traj, gamma=0.99):
    G, out = 0.0, []
    for (_, _, r) in reversed(traj):                   # reward-to-go, backwards
        G = r + gamma * G
        out.append(G)
    return list(reversed(out))

def reinforce_step(policy, traj, alpha=1e-2, gamma=0.99):
    Gs = returns_from(traj, gamma)
    for (s, a, _), G in zip(traj, Gs):
        grad_logp = policy.grad_log_prob(s, a)         # ∇_θ log π(a|s)
        policy.theta += alpha * G * grad_logp          # θ ← θ + α·G·∇log π
stdout
episode 0   return=17
episode 50  return=43
episode 100 return=78
episode 150 return=142
...

Thirty lines of arithmetic. No replay buffer, no target network, no importance weights, no clipping. The patron plays the night, tallies the payout backwards from the end (reward-to-go), and nudges θ once per arm-pull. Now vectorise — run N episodes in parallel, pack them into arrays, do the tallies in one sweep.

layer 2 — numpy · reinforce_numpy.py
python
import numpy as np

def compute_returns(rewards, gamma=0.99):
    """rewards: shape (T,)  →  returns: shape (T,)   reward-to-go."""
    T = len(rewards)
    G = np.zeros(T, dtype=np.float32)
    running = 0.0
    for t in reversed(range(T)):
        running = rewards[t] + gamma * running
        G[t] = running
    return G

def normalize(x, eps=1e-8):
    return (x - x.mean()) / (x.std() + eps)            # variance-reduction trick

def reinforce_update(grad_log_probs, returns, baseline, alpha=1e-2):
    """
    grad_log_probs : (T, D)   — stacked ∇log π for each step
    returns        : (T,)     — G_t
    baseline       : (T,)     — V(s_t), or zeros for vanilla REINFORCE
    """
    advantages = normalize(returns - baseline)         # (G_t − b(s_t)), z-scored
    grad = (advantages[:, None] * grad_log_probs).sum(axis=0)
    return alpha * grad                                # Δθ
pure python → numpy
for (s,a,_),G in zip(traj,Gs): θ += α·G·∇log π←→Δθ = α · (advantages[:,None] * grad_log_probs).sum(0)

sum over timesteps becomes one matrix reduction

returns computed in a Python for-loop←→reverse-scan once, writing into a preallocated array

same O(T), but no per-step Python overhead

raw G_t←→(G − b) / std(G)

subtract baseline, z-score — two one-liners of variance reduction

Now PyTorch. The gradient bookkeeping disappears — you just form the loss and call .backward(). The critical idiom is torch.distributions.Categorical (or Normal for continuous actions): you sample arm-pulls from it at rollout time, and at update time you ask it for log_prob(a), which is a differentiable tensor autograd can trace through.

layer 3 — pytorch · reinforce_pytorch.py
python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical

class Policy(nn.Module):
    def __init__(self, obs_dim, n_actions):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(obs_dim, 64), nn.Tanh(),
                                 nn.Linear(64, n_actions))
    def forward(self, s):
        return Categorical(logits=self.net(s))          # π(·|s)

class Value(nn.Module):
    def __init__(self, obs_dim):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(obs_dim, 64), nn.Tanh(),
                                 nn.Linear(64, 1))
    def forward(self, s):
        return self.net(s).squeeze(-1)                  # V_φ(s)

def train_step(policy, value, opt_p, opt_v, states, actions, returns):
    # --- policy update ----------------------------------------
    dist     = policy(states)
    log_prob = dist.log_prob(actions)                   # differentiable
    baseline = value(states).detach()                   # DETACH: no grad into V here
    advantage = returns - baseline
    advantage = (advantage - advantage.mean()) / (advantage.std() + 1e-8)

    loss_pi = -(log_prob * advantage).mean()            # minus: we're ascending
    opt_p.zero_grad(); loss_pi.backward(); opt_p.step()

    # --- value update (regression on returns) ------------------
    loss_v = F.mse_loss(value(states), returns)
    opt_v.zero_grad(); loss_v.backward(); opt_v.step()
stdout
ep 0    return= 22.0
ep 100  return= 58.4   (avg-last-10)
ep 250  return=197.8
ep 500  return=498.6   solved
numpy → pytorch
policy.grad_log_prob(s, a) — hand-rolled←→Categorical(logits=net(s)).log_prob(a)

autograd traces the ∇log π for you

Δθ = α · Σ_t advantage_t · ∇log π_t←→loss = -(log_prob * advantage).mean(); loss.backward()

minus sign: PyTorch minimizes, policy gradient ascends

baseline = V_φ(s) as a numpy array←→value(states).detach()

detach so the policy loss doesn’t flow gradients into V

Gotchas

Off-policy bug via stale log-probs: the trajectory must be sampled from the current policy, and the log_prob you use in the loss must come from the current policy as well. If you cache log-probs from rollout and then update θ multiple times, you're using stale log-probs and the gradient is biased. This is exactly the bug PPO solves with its importance ratio; vanilla REINFORCE just does one update per rollout.

Forgetting to detach the baseline: in the policy loss, advantage = returns − V_φ(s). If you don't .detach() the baseline, gradients from the policy loss flow into V_φ and confuse its objective (which is MSE against returns, not policy improvement). Detach.

Sign errors: policy gradient is gradient ascent (maximize return). PyTorch optimisers do gradient descent. Your loss must be −(log_prob * advantage).mean() — the minus sign is not optional. Half the “why isn't my agent learning” bugs are this.

Return computation direction: returns are cumulative from the end backwards — G_t = r_{t+1} + γ·G_{t+1}. The patron tallies the night by walking backwards from the last pull. If you accidentally sum forward or forget to reset at episode boundaries, the gradients are nonsense.

Measure the baseline's worth

Run two agents on CartPole with the same seed and hyperparameters: (a) vanilla REINFORCE, (b) REINFORCE with a learned value baseline. Define “solved” as average return ≥ 475 over the last 100 episodes.

Record episodes-to-solve for each. Plot the two learning curves on the same axes. The baseline version should solve it in roughly half as many episodes, with visibly less jitter along the way.

Bonus: add normalized returns to (b) and measure again. Then disable reward-to-go (use the full-trajectory tally for every timestep) in (a) and watch training get dramatically worse — that's how much free variance reduction you were getting.

What to carry forward. REINFORCE is the base case of every policy-gradient method: play an episode at the slot machine, tally the payout, weight log-probabilities by returns, step. Its unbiased gradient comes with ruinous variance, which is why every modern variant adds a baseline, an advantage estimator, and sometimes a trust region. Understanding REINFORCE + baseline is most of what you need to understand PPO, and PPO is most of what you need to understand RLHF.

Next up — Actor-Critic. The casino tracker has one glaring problem: it has to wait until the end of the night to learn anything. What if the patron could lean over to a seasoned regular at the next stool and ask, mid-pull, “is this state usually a good spot?” That regular is the critic — a value function V_φ(s) trained alongside the policy, quietly estimating how good each state is so the policy doesn't have to finish the whole episode to get feedback. You get tighter advantage estimates, bootstrapped returns, and the door opens to A2C, A3C, and PPO at scale. The critic is what turns the casino tracker into something that can actually ship.

References