Policy Gradients

Optimize the policy directly via gradient ascent on expected reward.

Hard

~15 min read

·lesson 3 of 6

Picture the policy as a dashboard. Every weight in the network is a knob. Your agent twists a particular set of those knobs and spits out a distribution over actions. It takes an action, the world answers with reward, the episode ends, and now the question staring you in the face is: which knob, if I turned it a hair, would make the payoff bigger next time?

That question has a name. It's called a policy gradient. It's a sensitivity dial — one reading per knob, telling you which direction of twist tends to pay off, and how strongly. You don't need the rulebook of the world. You don't need to differentiate through the environment. You just need to notice which direction of knob-twiddling correlates with more reward, averaged over a pile of rollouts. Every policy-based algorithm in reinforcement learning — vanilla REINFORCE, A2C, TRPO, PPO, the thing tuning ChatGPT at this very moment — is a refinement of that single sensitivity reading.

MDP gave us the contract: states, actions, rewards. Q-learning answered the control problem through a cheat-sheet-shaped back door — learn the value of every state-action, then argmax. That works until your actions live in ℝⁿ — steering a joint to 0.314159 radians, throttling a thrust to 0.42 — and theargmax turns into an optimisation problem per step. It also fails when the optimal policy is stochastic (mix rock/paper/scissors; any deterministic policy gets exploited). Value-based RL cannot, by construction, commit to anything other than the current best-scoring action.

Policy-based RL throws the cheat sheet out and walks up to the dashboard. Parameterise π_θ(a | s) as a neural net that reads a state and outputs a distribution over actions. Do gradient ascent on expected return. No value function, noargmax, no requirement that the environment be differentiable. This lesson derives the one line of calculus that makes that possible — the sensitivity dial — stares at it until it stops being magic, and then writes it three times in code.

The objective is embarrassingly clean: maximise the expected return under the policy.

the objective

J(θ)   =   E_{τ ~ π_θ} [ R(τ) ]

where τ = (s₀, a₀, r₀, s₁, a₁, r₁, …)   is a trajectory
      R(τ) = Σ_t γᵗ rₜ                   is its discounted return

And here is the result that launched the field — the policy gradient theorem. It says the gradient of this expectation has a form you can actually compute:

the policy gradient theorem (Sutton et al. 2000)

∇_θ J(θ)   =   E_{s, a ~ π_θ} [ ∇_θ log π_θ(a | s) · Q^π(s, a) ]

Stare at this for a second. It's the sensitivity reading we wanted. On the left, the gradient of expected return with respect to every knob on the dashboard — the thing that ordinarily requires you to differentiate through a sampling distribution and, worse, through the environment's transition dynamics. On the right, an expectation of a product of two quantities we can read off directly:

∇_θ log π_θ(a | s) — which direction on the dashboard would make the policy more likely to pick this exact action in this exact state. We own the policy network; this is one backward pass through a softmax or Gaussian head.
Q^π(s, a) — how good the action actually was, the expected return from taking a in s and continuing under π. We don't own the environment, but we can estimate this from the rewards we collected.

No derivative of the transition model appears. The environment can be a black box, a physics simulator, a real robot, a chatbot with a reward model bolted on — it doesn't matter, because we never differentiate through it. The knob sensitivities live entirely on our side of the wall.

policy gradient — derivation, one equation at a time

log-derivative trick · Monte-Carlo estimator

1 · objective#1

J(θ)= E_{τ ~ π_θ} [ R(τ) ]

what changedgoal: pick θ that maximises expected return over trajectories sampled from our policy.

2 · expand expectation#2

∇_θ J(θ)= ∇_θ ∫ π_θ(τ) · R(τ) dτ

3 · push gradient inside#3

∇_θ J(θ)= ∫ ∇_θ π_θ(τ) · R(τ) dτ

4 · log-derivative trick#4

∇_θ π_θ(τ)= π_θ(τ) · ∇_θ log π_θ(τ)

5 · expectation form#5

∇_θ J(θ)= E_{τ ~ π_θ} [ ∇_θ log π_θ(τ) · R(τ) ]

6 · REINFORCE estimator#6

∇̂_θ J(θ)≈ (1/N) Σᵢ Σ_t ∇_θ log π_θ(a_t | s_t) · R(τᵢ)

step 1 / 6

Step through the derivation. The move that makes everything work is the log-derivative trick: we can't push∇_θ inside an expectation that depends on θ, because the distribution itself depends on every knob on the dashboard. But the identity

log-derivative identity

∇_θ p_θ(x)   =   p_θ(x) · ∇_θ log p_θ(x)

lets us rewrite ∇_θ ∫ p_θ(x) f(x) dx as ∫ p_θ(x) · ∇_θ log p_θ(x) · f(x) dx, which is an expectation again — one we can estimate by sampling. That's the whole trick. Every score function estimator, every REINFORCE variant, every modern policy gradient algorithm is built on that one line of calculus. The ∇ log π × return reveal is the sensitivity dial in its final form: the direction-of-twist times how much reward that twist tends to produce.

Log-prob gradient (personified)

I am the credit carrier. I am exactly one thing — the direction on the dashboard that would make this specific action in this specific state more likely. I don't know whether the action was good. I don't know whether the episode succeeded. Someone else multiplies me by the return. My only job is to point toward the knob-twist that reinforces this choice. Scale me up, the policy commits harder. Scale me negative, it backs off. I am the steering wheel; the return is the driver.

The theorem hands us a gradient in terms of Q^π — which we don't have. The simplest workaround: roll out a whole episode, measure the return that actually followed each step, and use that as a one-sample Monte Carlo estimate of Q^π. This is REINFORCE (Williams, 1992):

REINFORCE — Monte Carlo policy gradient

   Gₜ     =   Σ_{k=t..T} γ^{k-t} r_k               return from step t onward

∇_θ J(θ)  ≈   Σ_t  ∇_θ log π_θ(aₜ | sₜ) · Gₜ         sum over one trajectory

   θ     ←   θ  +  α · ∇_θ J(θ)                    ascent, not descent

Read that middle line slowly, because it's the reason REINFORCE is a specific instance of the general sensitivity dial and not its own new idea. The policy gradient theorem said: knob-direction times how-good-the-action-was, averaged. REINFORCE just picks the cheapest possible how-good-the-action-was estimator — the raw return you observed — and plugs it in. Swap that estimator and you get every other algorithm in the family. Replace G_t with G_t − V(s_t) and you get REINFORCE-with-baseline. Replace it with the TD estimate r_t + γV(s_{t+1}) and you're doing actor-critic. Replace it with a clipped importance-weighted advantage and you're running PPO. Same dashboard, same knob-sensitivity reading, different estimator for the scalar we multiply it by.

Three things to notice about REINFORCE proper. First, it's an unbiased estimator of the true gradient — no function approximation for Q, just the actual return we observed. Second, it has famously high variance — a single episode's return is one draw from a long chain of random events, and the same policy can cough up wildly different numbers on back-to-back runs. The direction of the sensitivity dial is right on average, but its magnitude jitters like a needle in a hurricane. Third, it's embarrassingly simple: one forward pass to get log π, one backward pass scaled by G, done.

return-to-go → gradient weight

∇θ log π(a_t|s_t) · G_t

r_t

G_t (return-to-go)

|∇log π|

∇log π · G_t

+2.57

0.90

+2.31

-1

+2.85

1.10

+3.14

+4.28

0.60

+2.57

+4.76

1.40

+6.66

+3.06

1.20

+3.67

+2.29

0.80

+1.83

-2

+2.54

1.50

+3.82

+5.05

0.90

+4.54

+4.50

0.70

+3.15

+5.00

1.30

+6.50

positive G_t pushes log π(a_t|s_t) up — the network makes that action more likely. negative G_t does the opposite. steps that land near a big future reward get amplified, while unrelated steps shrink toward zero.

γ0.900

Σ grad⁺38.19

Σ grad⁻0.00

net38.19

Here's a trajectory laid out step by step. Watch the sensitivity reading at each timestep: the log-prob gradient of the action we took, scaled by the return that followed. Late actions get a small return (little time left to collect reward). Early actions get a big return (everything the agent did afterwards feeds into their credit). That asymmetry is exactly the credit assignment problem in RL — and REINFORCE's answer is blunt: every action gets credit for every reward that came after it, discounted by how long it had to wait.

If the final reward was +1, every action in the trajectory gets a positive twist-direction, proportional to its discounted share. If the final reward was −1, every action gets pushed down. This is unsubtle — even actions that were genuinely good early in a losing episode get suppressed — and it's why variance reduction matters so much in practice. The dial is pointing the right way on average, but any single reading shouts where it should whisper.

Return (personified)

I am the scalar that decides whether to turn the sensitivity reading up or down. I come from the environment, not from your network — I am a number you measured, not a quantity you differentiate. Detach me before you multiply. I am the judge, not the witness. Treat me as a target, carry me through the loss as a coefficient, and I'll tell your policy which of its recent choices deserve louder voices and which deserve the silent treatment.

Three implementations. A pure-Python REINFORCE on a 3-arm bandit so the moving parts are visible. A NumPy version on a tiny CartPole surrogate that shows how returns get computed in a loop. A PyTorch version with autograd, an entropy bonus, and a value-function baseline — this is what you'd actually write.

layer 1 — pure python · reinforce_bandit.py

python

import math, random

# 3-arm bandit — only one "state"; each arm pays +1 with its own probability.
true_probs = [0.2, 0.8, 0.5]

# Policy = softmax over 3 logits. θ = the three logits themselves.
theta = [0.0, 0.0, 0.0]
alpha = 0.1

def softmax(logits):
    m = max(logits)
    exps = [math.exp(l - m) for l in logits]
    Z = sum(exps)
    return [e / Z for e in exps]

for step in range(501):
    probs = softmax(theta)
    # Sample action from the policy.
    a = random.choices(range(3), weights=probs)[0]
    # Play it, observe the reward.
    r = 1.0 if random.random() < true_probs[a] else 0.0
    # ∇log π(a|·) for softmax:  e_a − probs
    # (indicator of the chosen action, minus the probability vector).
    grad_logp = [-p for p in probs]
    grad_logp[a] += 1.0
    # REINFORCE update: θ ← θ + α · G · ∇log π(a|·).  Here G = r (one-step).
    for i in range(3):
        theta[i] += alpha * r * grad_logp[i]
    if step in (0, 50, 200, 500):
        ps = [f"{p:.2f}" for p in probs]
        print(f"step {step:>3}: probs=[{' '.join(ps)}]  pulled={a}  reward={int(r)}")

print("converged on arm", max(range(3), key=lambda i: theta[i]),
      "(true best = arm 1 with p=0.8)")

stdout

step   0:  probs=[0.33 0.33 0.33]  pulled=1  reward=0
step  50:  probs=[0.25 0.41 0.34]  pulled=1  reward=1
step 200:  probs=[0.11 0.72 0.17]  pulled=1  reward=1
step 500:  probs=[0.03 0.93 0.04]  pulled=1  reward=1
converged on arm 1 (true best = arm 1 with p=0.8)

One state, one step, a softmax, a hand-rolled gradient. Three knobs on the dashboard, one sensitivity reading per step, one twist per sample. Scale this up: instead of three logits, a neural net. Instead of one step, a whole episode whose return we have to accumulate backward through time. NumPy.

layer 2 — numpy · reinforce_cartpole.py

python

import numpy as np
import gym  # or gymnasium

env = gym.make("CartPole-v1")
rng = np.random.default_rng(0)

# Simple 2-layer policy: obs(4) → 16 → 2 (softmax).
W1 = rng.standard_normal((4, 16)) * 0.1
W2 = rng.standard_normal((16, 2)) * 0.1
lr = 1e-2
gamma = 0.99

def forward(obs):
    h = np.tanh(obs @ W1)
    logits = h @ W2
    logits -= logits.max()
    p = np.exp(logits); p /= p.sum()
    return p, h

def compute_returns(rewards):
    """G_t = r_t + γ r_{t+1} + γ² r_{t+2} + …   backward recurrence."""
    G = np.zeros_like(rewards, dtype=float)
    running = 0.0
    for t in reversed(range(len(rewards))):
        running = rewards[t] + gamma * running
        G[t] = running
    # Normalise within batch — variance reduction, see callout below.
    G = (G - G.mean()) / (G.std() + 1e-8)
    return G

for ep in range(501):
    obs, _ = env.reset()
    obss, acts, rewards, hiddens = [], [], [], []
    done = False
    while not done:
        p, h = forward(obs)
        a = rng.choice(2, p=p)
        obs2, r, term, trunc, _ = env.step(a)
        done = term or trunc
        obss.append(obs); acts.append(a); rewards.append(r); hiddens.append(h)
        obs = obs2

    # Monte Carlo returns, normalised.
    G = compute_returns(np.array(rewards))

    # Gradient accumulation — one REINFORCE step per trajectory.
    dW1 = np.zeros_like(W1); dW2 = np.zeros_like(W2)
    for t in range(len(obss)):
        p, _ = forward(obss[t])
        dlogit = -p; dlogit[acts[t]] += 1.0           # ∇log π w.r.t. logits
        dlogit *= G[t]                                # scale by return
        dW2 += np.outer(hiddens[t], dlogit)           # ∇w.r.t. W2
        dh   = dlogit @ W2.T * (1 - hiddens[t]**2)    # backprop through tanh
        dW1 += np.outer(obss[t], dh)                  # ∇w.r.t. W1

    # Ascent (note the plus).
    W1 += lr * dW1
    W2 += lr * dW2

    if ep in (0, 50, 150, 300, 500):
        print(f"ep {ep:>3}: steps={len(rewards):>3}  return={sum(rewards):.2f}"
              + ("   (solved)" if sum(rewards) >= 195 else ""))

stdout

ep   0:  steps= 14  return=14.00
ep  50:  steps= 42  return=42.00
ep 150:  steps= 98  return=98.00
ep 300:  steps=196  return=196.00
ep 500:  steps=200  return=200.00   (solved)

pure python → numpy

scalar reward r on one step←→compute_returns(rewards): backward γ-recurrence

— multi-step trajectories need the whole return, not just r_t

hand-written softmax over 3 logits←→forward(obs): np @ obs @ W1 → tanh → W2 → softmax

— policy becomes a 2-layer net; same log-prob gradient shape

θ ← θ + α · r · ∇log π←→dW1, dW2 accumulated over the trajectory, scaled by G_t

— one update per episode, summed across timesteps

Now PyTorch. Autograd does the backprop for us, we add a value-function baseline to cut variance, and we tack on an entropy bonus to keep the policy from collapsing too fast.

layer 3 — pytorch · reinforce_pytorch.py

python

import torch
import torch.nn as nn
import torch.nn.functional as F
import gym

env = gym.make("CartPole-v1")
gamma = 0.99

class ActorCritic(nn.Module):
    """Shared trunk → policy head (logits) and value head (scalar baseline)."""
    def __init__(self):
        super().__init__()
        self.trunk  = nn.Sequential(nn.Linear(4, 64), nn.Tanh(),
                                    nn.Linear(64, 64), nn.Tanh())
        self.policy = nn.Linear(64, 2)
        self.value  = nn.Linear(64, 1)
    def forward(self, obs):
        h = self.trunk(obs)
        return self.policy(h), self.value(h).squeeze(-1)

net = ActorCritic()
opt = torch.optim.Adam(net.parameters(), lr=3e-3)

for ep in range(501):
    obs, _ = env.reset()
    log_probs, values, rewards, entropies = [], [], [], []
    done = False
    while not done:
        obs_t = torch.as_tensor(obs, dtype=torch.float32)
        logits, v = net(obs_t)
        dist = torch.distributions.Categorical(logits=logits)
        a = dist.sample()
        obs, r, term, trunc, _ = env.step(a.item())
        done = term or trunc
        log_probs.append(dist.log_prob(a))        # ∇log π is handled by autograd
        values.append(v)                          # baseline for variance reduction
        entropies.append(dist.entropy())          # bonus: encourage exploration
        rewards.append(r)

    # Monte Carlo returns (targets — no grad flows through these).
    G, running = [], 0.0
    for r in reversed(rewards):
        running = r + gamma * running
        G.insert(0, running)
    G = torch.tensor(G, dtype=torch.float32)
    G = (G - G.mean()) / (G.std() + 1e-8)          # per-batch normalisation

    log_probs = torch.stack(log_probs)
    values    = torch.stack(values)
    entropies = torch.stack(entropies)

    # Advantage = return − baseline.  Detach G; only V's own loss trains V.
    advantage = G - values.detach()
    policy_loss = -(log_probs * advantage).mean()  # minus, because Adam minimises
    value_loss  = F.mse_loss(values, G)            # critic regresses toward G
    entropy_bonus = entropies.mean()
    loss = policy_loss + 0.5 * value_loss - 0.01 * entropy_bonus

    opt.zero_grad(); loss.backward(); opt.step()

    if ep in (0, 100, 300, 500):
        print(f"ep {ep:>3}: return={sum(rewards):>6.1f}   "
              f"loss={loss.item():>6.2f}   ent={entropy_bonus.item():.2f}"
              + ("   (solved)" if sum(rewards) >= 195 else ""))

stdout

ep   0:  return= 21.0   loss=  1.07   ent=0.69
ep 100:  return= 48.2   loss=  0.41   ent=0.62
ep 300:  return=173.4   loss= -0.11   ent=0.39
ep 500:  return=198.7   loss= -0.08   ent=0.22   (solved)

numpy → pytorch

manual ∇log π: dlogit = -p; dlogit[a] += 1←→dist.log_prob(a) + loss.backward()

— autograd traces the log-prob through the softmax

scale grads by G_t, accumulate dW1/dW2←→-(log_probs * advantage).mean() then opt.step()

— negate because PyTorch optimisers minimise; we want ascent

no baseline — raw return←→advantage = G − values.detach()

— critic cuts variance; detach so G is a pure target

no exploration incentive←→- 0.01 * entropies.mean()

— small bonus keeps the policy from collapsing prematurely

Gotchas

Forgetting to detach returns. The return G_t is a target, not a differentiable quantity — you computed it from rewards the environment gave you. If you leave it attached to the computation graph (easy to do when G = something_involving_V), gradients flow back through it in directions you didn't plan. Always .detach() the thing that multiplies the log-prob.

Normalising returns across episodes vs within a batch. Per-batch normalisation (G − mean) / std helps a lot — it absorbs drift in the scale of returns as the policy improves. But normalising across all history destroys signal (good returns stop looking good once everything is good). Do it per update step.

Entropy coefficient too high. Set β = 0.1 by accident and your policy will refuse to commit to anything — you'll be training a glorified uniform distribution for the entire run. Watch the entropy curve: it should decrease over training, just not all the way to zero.

Using V(s) as a baseline without detaching. The actor's gradient uses advantage = G − V(s). If you leave V attached, the actor's loss will also try to push V around — usually in the wrong direction. Detach V in the policy loss; let the value loss train it separately.

Treating loss sign carelessly. The theorem gives us a gradient we want to ascend. PyTorch optimisers descend. Negate the policy loss. Forgetting this once means your agent actively learns to do worse — it's an almost-silent bug because the numbers look plausible. Your sensitivity dial is still pointing at the right knobs; you're just turning them the wrong direction.

Train REINFORCE on CartPole with and without a baseline

Start with the PyTorch snippet above, but strip out the value head and the value loss so it's vanilla REINFORCE — the policy gradient scaled by the raw normalised return G. Train for 500 episodes on CartPole-v1 and plot the return per episode.

Now add the value-function baseline back in. Same network, same hyperparams, same seed. Plot the return curve on the same axes. You should see two things: the baseline version reaches “solved” (200 steps) in fewer episodes, and the variance of the return curve is visibly lower. That gap is the baseline earning its keep — a steadier dial reading per update.

Bonus: run both variants across 10 seeds and plot the mean ± std. The baseline's contribution is mostly about variance reduction across seeds, not raw final performance — modern actor-critic methods exist because of this plot.

What to carry forward. Policy gradients are a sensitivity dial. For every knob on the policy network, ∇ log π × return tells you which direction of twist tends to produce more reward — averaged, unbiased, estimable from trajectories alone, no environment derivative required. Baselines and entropy bonuses are not optional sophistication; they are the difference between a method that works and a method that theoretically should. Everything downstream in this section — REINFORCE as a first-class algorithm, actor-critic, PPO — is this same dial, paired with a better way of estimating the scalar that scales it.

Next up — REINFORCE. We've derived the estimator and glued a demo version together here for pedagogy. The next lesson zooms all the way in on REINFORCE as a first-class algorithm: the training loop, batching across episodes, when to reset, how to read its jagged casino-tracker of a learning curve, and the specific hyperparameter traps that make it look broken when it isn't. From there it's a short step to actor-critic, PPO, and the algorithms actually running production RL today.

References

[01]
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
R. J. Williams · Machine Learning, 1992 — the REINFORCE paper
[02]
Policy Gradient Methods for Reinforcement Learning with Function Approximation
Sutton, McAllester, Singh, Mansour · NeurIPS 2000 — the policy gradient theorem
[03]
Reinforcement Learning: An Introduction (2nd ed.), Ch. 13
Sutton & Barto · 2018
[04]
Asynchronous Methods for Deep Reinforcement Learning (A3C)
Mnih et al. · ICML 2016 — entropy bonus as an actor-critic regulariser
[05]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
Schulman, Moritz, Levine, Jordan, Abbeel · ICLR 2016 — modern baseline / advantage estimators