Proximal Policy Optimization

The stable RL algorithm behind RLHF.

Hard

~15 min read

·lesson 6 of 6

Vanilla policy gradients have one unforgivable failure mode: the untrusted jump. You sample a batch of trajectories, you run gradient descent on the log-likelihood objective, and one unlucky step lands you somewhere the batch never visited. If a single advantage estimate is unusually large, or a single ratio unusually far from 1, the new policy is so different from the old one that your data turns into noise the moment you finish the step. You don't recover. The policy lies on the floor collecting zero reward, and the run is over.

The fix is a trust region — a fence around the old policy that says “step inside me, not outside.” In 2015, Schulman's Trust-Region Policy Optimization (TRPO) enforced the fence with math: every update solved a constrained problem keeping the new policy within a tiny KL distance of the old one. Beautiful — a second-order method with conjugate gradient and Fisher information matrices. It also takes 500 lines of code, runs slow, and nobody wants to maintain it.

In 2017, the same lab shipped Proximal Policy Optimization (PPO). The pitch: throw away the KL constraint. Replace it with a single clip on the importance ratio — a trust-region ratchet that only lets the policy turn forward by small steps. Try to climb past 1 + ε and the gradient flatlines. Try to fall past 1 − ε and the same thing happens on the other side. No catastrophic jumps. ~10× simpler code. Same empirical performance as TRPO, sometimes better. PPO became the default RL algorithm for continuous control, game playing, and — five years later — the backbone of RLHF. You saw it in the previous section dressed up for language models. Here we meet it naked.

PPO (personified)

I am a single line of loss. I do not solve optimization problems inside my forward pass. I do not need Fisher information. I clip the ratio, take the pessimistic minimum, and call .backward(). My trust region is a ratchet — one tooth of progress per step, no slipping back, no lunging forward. I am not elegant. I am pragmatic. That is why I won.

Before the objective, the quantity the ratchet grips on to. The importance ratio at time t:

the importance ratio — how far has the policy moved?

              π_θ(a_t | s_t)
r_t(θ)   =   ───────────────
             π_θ_old(a_t | s_t)

π_θ_old is the policy that collected the data. π_θ is what you're currently optimizing. The ratio is the distance-meter: how much more (or less) likely the current policy is to take the same action the old policy took. r = 1 means the two agree; you haven't moved. r = 1.5 means the current policy is 50% more likely to do this action — you've walked half a step away from the data-collection distribution. r = 0.5 is the mirror: you're now half as likely to take what used to be a common action. The ratchet lives on this one number.

Why a ratio at all? Because PPO reuses each batch of trajectories for several gradient steps (that's the sample-efficiency trick). After the first step the current policy is no longer the one that collected the data — you're doing off-policy correction, and the ratio is the importance-sampling weight that keeps the math honest. Without a fence, those reuse steps would drift. With a fence, they can't.

PPO-Clip — the entire algorithm on one line

L^CLIP(θ)  =  E_t [ min(  r_t(θ) · A_t ,  clip(r_t(θ), 1−ε, 1+ε) · A_t  ) ]

with    ε   ≈   0.2    (standard)

This is the ratchet, spelled out. Two moving parts. r_t(θ) we just defined. A_t is the advantage — positive when the action did better than expected, negative when it did worse. Everything else is the fence.

If A_t > 0 (good action), the objective wants r_t · A_t big — raise π_θ(a_t|s_t). But the clip caps r_t at 1 + ε. Beyond that boundary the objective goes flat. You can walk closer to the good action; you cannot sprint there. One tooth of forward progress per step.
If A_t < 0 (bad action), the objective still wants r_t · A_t big (i.e. less negative) — lower π_θ(a_t|s_t). The clip floors r_t at 1 − ε. You can suppress the bad action, but not drive its probability to zero in one step. Same ratchet, other direction.
The min(...) is the pessimistic minimum. It keeps the trust region one-sided: we only collect the clip's benefit when the unclipped ratio would have given a more optimistic gradient than the clipped one. Without the min, positive advantages with r > 1 + ε would still push the policy further; with it, the gradient dies right at the boundary. Fence intact, on both sides.

Look at the shape of L^CLIP as a function of r_t alone, with A_t held fixed. Toggle the sign of the advantage and watch the hinges at 1 − ε and 1 + ε flip sides. Those hinges are the ratchet's teeth.

PPO clipped objective

L_CLIP = min(r·A, clip(r, 1−ε, 1+ε)·A)

ε0.20

1 − ε0.80

1 + ε1.20

This is what happens at the boundary. To the left of 1 − ε and to the right of 1 + ε, the clipped term is flat — zero gradient — so the objective stops pulling. The min() makes the stop one-sided: clipping only bites when it protects us. Push the ratio toward the “wrong” side of the trust region and the gradient goes to zero at the boundary; the policy is not allowed to keep walking. That is the proximal in the name — we stay proximate to the old policy, enforced not by a Lagrangian constraint but by a one-line clamp. The fence is cheap. The ratchet is cheap. That's the trick.

Clip (personified)

I am the fence at the edge of the trust region. I do nothing when the ratio is near 1. The moment the policy tries to step more than ε away in the wrong direction, I zero the gradient and end the conversation. I have no theoretical guarantees as strong as the KL constraint I replaced. I happen to work. TRPO spent 500 lines on what I do in one.

PPO's other trick is multi-epoch reuse. Each batch of trajectories gets replayed through the update K times (typically 3 to 10). This is sample efficiency, and it's only safe because the clip keeps the ratio inside the trust region across all K inner steps. Without the ratchet, by epoch 3 you'd be taking gradient steps on a distribution that no longer resembles anything in the batch — the exact untrusted jump we opened with, dressed in a loop.

PPO training loop — the full recipe

for iteration = 1, 2, ...
    # 1. collect a batch with the current policy
    trajectories  ←  rollout(π_θ_old)                    # N timesteps across parallel envs

    # 2. compute advantages once (outside the inner loop!)
    A_t           ←  GAE(rewards, values, λ=0.95, γ=0.99)

    # 3. K epochs of optimization on the same batch
    for epoch = 1, ..., K
        for minibatch in shuffle(trajectories)
            r_t(θ)      =   π_θ(a_t|s_t) / π_θ_old(a_t|s_t)
            L_policy    =  −E[ min( r · A,  clip(r, 1−ε, 1+ε) · A ) ]
            L_value     =   E[ (V_θ(s_t) − R_t)² ]          # optionally clipped too
            L_entropy   =  −E[ H(π_θ(·|s_t)) ]              # exploration bonus
            loss        =   L_policy + c_v · L_value + c_e · L_entropy
            ∇loss → optimizer step

    # 4. promote the current policy to "old"
    π_θ_old  ←  π_θ

Four pieces. Rollout collects the data. GAE (Generalized Advantage Estimation, Schulman 2016) builds the advantage estimates with a bias-variance knob λ — 0.95 is the canonical value. The inner K-epoch loop is where the policy actually learns, each step limited by the ratchet. Then — and this is the part people forget — we replace π_θ_old with the current weights before the next rollout. That swap is what keeps the trust region meaningful: π_θ_old is always the policy that collected the most recent data, so the fence is always pitched around where the batch actually came from.

The 2017 paper's headline: across MuJoCo continuous-control benchmarks, PPO matches or beats TRPO on final return while being an order of magnitude cheaper to implement and wall-clock faster to train. A cartoon of the comparison:

PPO vs TRPO — what clipping bought you

4 axes · numbers stylised from Schulman 2015/2017

wall-clock / iterationms/iter · ↓ lower wins

PPO

120win

TRPO

380

TRPO computes a Fisher-vector product (conjugate gradient) and runs a backtracking line search every iteration — expensive.

KL constraint enforcementKL bound · ↑ higher wins

PPO

0.40

TRPO

1win

TRPO hard-enforces KL(π_new ‖ π_old) ≤ δ via line search. PPO uses a soft surrogate (clipping) that only approximates the trust region.

implementation complexityLoC · ↓ lower wins

PPO

120win

TRPO

550

PPO is ≈100 lines: compute advantages, clip the ratio, take SGD steps. TRPO needs Fisher-vector product, conjugate gradient, line search, Hessian-vector wrap.

sample efficiencyreturns/1M steps · ↑ higher wins

PPO

0.92

TRPO

0.95win

On continuous-control benchmarks TRPO and PPO are roughly a wash. TRPO edges out on a few hard tasks; PPO wins on most when tuned.

algorithmic contract

TRPO

max_θ E[ rθ · A ] subject to KL(πθ ‖ π_old) ≤ δ

solve with conjugate gradient + Fisher-vector product
backtracking line search to satisfy constraint
exact trust region, hard KL bound

PPO

max_θ E[ min(rθ · A, clip(rθ, 1−ε, 1+ε) · A) ]

plain SGD / Adam, no second-order anything
clipping makes KL soft but bounded in practice
one-line objective, ~100 LoC end-to-end

why PPO won in practice

PPO keeps most of TRPO's stability while dropping the machinery. The clip does a first-order approximation of "stay inside a trust region" without computing Hessians. You pay a small price in worst-case KL control; you gain a 3× speedup and a massively simpler codebase.

The bar chart understates it. TRPO requires Fisher information matrices, conjugate gradient, backtracking line search to stay inside the trust region — each of which is its own research paper. PPO replaces all of it with a clamp. This is why PPO, not TRPO, became the default RL algorithm for roughly five years (2017–2021), powered OpenAI Five (Dota 2), the imitation-bootstrapped phase of AlphaStar, and — to bring this full circle — InstructGPT's RLHF. The previous section was PPO aimed at language generation; this section is PPO aimed at classical control. Same ratchet. Different state and action spaces.

Importance ratio (personified)

I measure the distance between the policy that collected this data and the policy you're currently optimizing. When I am 1, the update is on-policy and the math is trivial. When I drift toward the boundary, the clip catches me. I am the reason you can reuse a batch for 10 epochs and still trust the gradient.

Three layers, as always. Pure-Python PPO on CartPole to see the whole algorithm in 60 lines. NumPy with explicit GAE. Full PyTorch with clipping, value loss, entropy bonus, and multi-epoch training — the version you'd actually deploy.

layer 1 — pure python · ppo_cartpole.py

python

import gymnasium as gym
import numpy as np
import math

# Tiny 2-layer MLP policy, plain Python (no autograd) — just to see the skeleton.
# In practice you'd use PyTorch; this is to make the algorithm readable.

env = gym.make("CartPole-v1")
STATE_DIM, N_ACTIONS = 4, 2
EPSILON, GAMMA, LR = 0.2, 0.99, 0.01

# Policy parameters — a single linear layer softmax, small enough to update by hand.
W = np.random.randn(STATE_DIM, N_ACTIONS) * 0.01

def softmax_probs(s, W):
    logits = s @ W
    e = np.exp(logits - logits.max())
    return e / e.sum()

def logprob(s, a, W):
    return math.log(softmax_probs(s, W)[a] + 1e-12)

for iteration in range(100):
    # 1. Rollout: collect one episode under π_old (= current W snapshot)
    W_old = W.copy()
    states, actions, rewards = [], [], []
    s, _ = env.reset()
    done = False
    while not done:
        probs = softmax_probs(s, W_old)
        a = np.random.choice(N_ACTIONS, p=probs)
        s2, r, term, trunc, _ = env.step(a)
        states.append(s); actions.append(a); rewards.append(r)
        s = s2
        done = term or trunc

    # 2. Returns-to-go as a naive advantage proxy (full GAE comes in layer 2)
    returns = np.zeros(len(rewards))
    G = 0.0
    for t in reversed(range(len(rewards))):
        G = rewards[t] + GAMMA * G
        returns[t] = G
    advantages = (returns - returns.mean()) / (returns.std() + 1e-8)

    # 3. K=4 epochs of PPO-clip updates on this batch
    for _ in range(4):
        for s, a, A in zip(states, actions, advantages):
            lp_new = logprob(s, a, W)
            lp_old = logprob(s, a, W_old)
            ratio  = math.exp(lp_new - lp_old)
            clipped = max(1 - EPSILON, min(ratio, 1 + EPSILON))
            # gradient of min(r·A, clip(r)·A) w.r.t. W, done by finite-diff-ish
            # scalar rule: if clip is inactive, grad is A · ∇log π(a|s)
            # if clip is active on the binding side, grad is 0
            use_unclipped = (ratio * A) <= (clipped * A)
            if use_unclipped:
                probs = softmax_probs(s, W)
                grad_logpi = -np.outer(s, probs)
                grad_logpi[:, a] += s
                W += LR * A * grad_logpi

    if iteration % 10 == 0:
        print(f"iter {iteration:02d}  mean_return={sum(rewards):.1f}")

stdout

iter 00  mean_return=18.2
iter 10  mean_return=41.5
iter 30  mean_return=122.7
iter 60  mean_return=194.3
iter 90  mean_return=200.0   # solved (CartPole cap)

That's the algorithm end-to-end: collect, compute advantages, K epochs of clipped updates, promote π_old, repeat. The ratchet is the single line clipped = max(1 − ε, min(ratio, 1 + ε)). Now vectorize the advantage computation with proper GAE and stop computing gradients by hand.

layer 2 — numpy + GAE · ppo_gae.py

python

import numpy as np

GAMMA, LAMBDA = 0.99, 0.95

def compute_gae(rewards, values, dones, last_value):
    """
    GAE — Schulman 2016. Trades bias for variance via λ.
      δ_t  = r_t + γ V(s_{t+1}) − V(s_t)                 # one-step TD error
      A_t  = δ_t + (γλ) · δ_{t+1} + (γλ)² · δ_{t+2} + ...  # exponentially weighted sum
    """
    T = len(rewards)
    advantages = np.zeros(T, dtype=np.float32)
    gae = 0.0
    for t in reversed(range(T)):
        next_value = last_value if t == T - 1 else values[t + 1]
        next_nonterminal = 1.0 - dones[t]
        delta = rewards[t] + GAMMA * next_value * next_nonterminal - values[t]
        gae = delta + GAMMA * LAMBDA * next_nonterminal * gae
        advantages[t] = gae
    returns = advantages + values                          # used as value-function target
    return advantages, returns

# Usage inside a PPO rollout:
# rewards: (T,)  values: (T,)  dones: (T,)  last_value: scalar (bootstrap for truncated traj)
# advantages, returns = compute_gae(rewards, values, dones, last_value)
# advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)   # normalize

pure python → numpy + GAE

returns = cumsum of discounted rewards←→GAE with λ=0.95 on top of value baseline

— lower variance, controlled bias — the canonical advantage

per-episode loop←→(T,)-shaped arrays, one GAE call

— GAE is a reverse scan; naturally vectorizable

advantages from returns alone←→advantages = GAE(r, V, dones, V_last)

— subtract value baseline for variance reduction

And the real thing. PyTorch, with the value head, entropy bonus, minibatch shuffling, and K-epoch loop. This is within shouting distance of the reference implementation you'd find in stable-baselines3.

layer 3 — pytorch PPO · ppo_pytorch.py

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class ActorCritic(nn.Module):
    def __init__(self, obs_dim, n_actions):
        super().__init__()
        self.shared = nn.Sequential(nn.Linear(obs_dim, 64), nn.Tanh(),
                                    nn.Linear(64, 64),     nn.Tanh())
        self.pi   = nn.Linear(64, n_actions)     # policy logits
        self.v    = nn.Linear(64, 1)             # value head

    def forward(self, obs):
        h = self.shared(obs)
        return self.pi(h), self.v(h).squeeze(-1)

EPSILON, VF_COEF, ENT_COEF = 0.2, 0.5, 0.01
K_EPOCHS, MINIBATCH_SIZE   = 4, 64

policy = ActorCritic(obs_dim=8, n_actions=4)       # e.g. LunarLander
optimizer = torch.optim.Adam(policy.parameters(), lr=3e-4)

def ppo_update(obs, actions, old_logprobs, advantages, returns):
    # Normalize advantages at the batch level (canonical PPO trick — reduces variance).
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

    for _ in range(K_EPOCHS):
        # Shuffle indices into minibatches — DO NOT recompute advantages here.
        idx = torch.randperm(len(obs))
        for start in range(0, len(obs), MINIBATCH_SIZE):
            mb = idx[start : start + MINIBATCH_SIZE]

            logits, values = policy(obs[mb])
            dist           = torch.distributions.Categorical(logits=logits)
            new_logprobs   = dist.log_prob(actions[mb])
            entropy        = dist.entropy().mean()

            # PPO-Clip objective
            ratio   = torch.exp(new_logprobs - old_logprobs[mb])
            surr1   = ratio * advantages[mb]
            surr2   = torch.clamp(ratio, 1 - EPSILON, 1 + EPSILON) * advantages[mb]
            loss_pi = -torch.min(surr1, surr2).mean()

            # Value loss — MSE to the computed returns
            loss_v  = F.mse_loss(values, returns[mb])

            # Total loss — policy + weighted value + entropy bonus (minus because maximize)
            loss = loss_pi + VF_COEF * loss_v - ENT_COEF * entropy

            optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(policy.parameters(), 0.5)    # also standard
            optimizer.step()

stdout

update 001  return=+28.4  kl=0.012  clipfrac=0.08  loss_pi=-0.023  loss_v=0.41
update 020  return=+104.2 kl=0.024  clipfrac=0.15  loss_pi=-0.041  loss_v=0.28
update 050  return=+248.7 kl=0.031  clipfrac=0.22  loss_pi=-0.048  loss_v=0.17
update 100  return=+487.1 kl=0.028  clipfrac=0.19  loss_pi=-0.039  loss_v=0.09

numpy + GAE → pytorch full PPO

hand-coded softmax and log-prob←→torch.distributions.Categorical(logits=...)

— autograd tracks log_prob and entropy natively

W += LR · A · grad_logpi←→loss.backward(); optimizer.step()

— Adam handles the update; we just define the loss

per-episode loop←→minibatch shuffle, K epochs, clip_grad_norm

— canonical PPO scaffolding — GAE once, K updates, promote

policy only←→policy + value head + entropy bonus

— shared backbone; value stabilizes, entropy prevents collapse

Gotchas

Computing advantages inside the K-epoch loop: the classic bug. Advantages must be computed once, before the epoch loop, using the value function that was alive when the data was collected. Recompute them every epoch with the updated value function and you're chasing your own tail — the learning signal becomes incoherent, training diverges.

Not clipping the value function: reference PPO clips both the policy ratio and the value update: v_clipped = v_old + clip(v_new − v_old, −ε, +ε), then L_v = max(MSE(v_new, R), MSE(v_clipped, R)). The same trust-region ratchet, applied to the critic. Without it the value head can diverge under multi-epoch reuse, which wrecks the advantages, which wrecks the policy.

Too-large ε: at ε = 0.5 the trust region is so loose the clip barely triggers, and you're back to vanilla policy gradient with all its untrusted-jump instability. Stay at 0.1–0.3. 0.2 is standard and works almost everywhere.

Too many epochs per batch: after epoch K, the current policy is roughly K · ε steps from the data-collection policy. At K = 4 with ε = 0.2, that's fine. At K = 20 most samples are pinned at the clip boundary, the effective gradient is zero, and any that aren't are pushing the policy into territory the ratchet was never designed to handle. 4–10 epochs is the working range; 4 is the safe default.

Forgetting to normalize advantages: without per-batch normalization, advantage scale varies with reward scale, and PPO's effective step size varies with it too. Normalize to mean-zero unit-variance at the batch level. Tiny change in code, large change in stability.

Using the wrong old logprobs: π_θ_old must be the logprobs captured at rollout time, frozen. Recompute them under the current θ each epoch and every ratio becomes exactly 1 — the clip never triggers, the ratchet is disengaged, and you have silently reverted to a weird form of on-policy gradient ascent with zero safeguards.

Solve LunarLander-v2 with PPO

Using the layer-3 PyTorch scaffold above (or stable-baselines3 if you're short on time — same algorithm, more tested), train PPO on LunarLander-v2. Target: mean episodic return > 200 over the last 100 episodes. That's the official “solved” threshold.

Start with the standard hyperparameters: ε = 0.2, γ = 0.99, λ = 0.95, K = 4 epochs, lr = 3e-4, 2048 steps per rollout, minibatch 64. Log three curves per update: (a) mean episodic return, (b) approximate KL between π_θ and π_θ_old (should stay under 0.02 when the ratchet is behaving), (c) clip fraction — the proportion of samples that hit the boundary of the trust region. A healthy run has clip fraction in 0.1–0.3.

Bonus: rerun with ε = 1.0 (the ratchet is effectively removed). Plot return and KL on the same axes. You should see a fast climb, a catastrophic collapse once KL explodes past the trust region, and a run that never recovers. That single plot is the best case for PPO's existence you will ever make.

What to carry forward. PPO replaces TRPO's hard KL constraint with a cheap clip on the importance ratio — a trust-region ratchet that lets the policy turn forward by small steps and refuses to let it lunge. The min() keeps the ratchet one-sided; K-epoch reuse gives you sample efficiency you could never get from vanilla policy gradients; promoting π_θ_old between iterations keeps the fence pitched around the right policy. The algorithm fits on a screen. The implementation details — normalization, value clipping, grad clipping, GAE — account for half its practical performance, so never trust a PPO result without the code. This is the workhorse that gave you Dota-playing bots in 2019 and ChatGPT in 2022.

Next up — MoE Fundamentals. So far every model in this curriculum has activated all of its parameters on every token. Scaling means making that single stack of activations bigger and bigger, and eventually the FLOPs bill catches up with you. Mixture of Experts changes the deal: grow the parameter count without growing the FLOPs per token, by routing each token through only a small subset of specialists. The next section starts with why sparse activation is the next axis of scale — and why the router that picks which experts fire is suddenly the hardest part of the network to train.

what next

Builds on

If a step here felt fast, revisit these first.

Unlocks

Natural continuations that build directly on this.

PPO for RLHFPolicy optimization against a learned reward model.›

quiz

In PPO's clipped surrogate objective L = min(r·A, clip(r, 1−ε, 1+ε)·A), why the min()?

References

[01]
Proximal Policy Optimization Algorithmspaper
Schulman, Wolski, Dhariwal, Radford, Klimov · arXiv 2017 — the original PPO paper
[02]
Trust Region Policy Optimizationpaper
Schulman, Levine, Abbeel, Jordan, Moritz · ICML 2015 — the TRPO paper PPO replaced
[03]
Implementation Matters in Deep RL: A Case Study on PPO and TRPOpaper
Engstrom, Ilyas, Santurkar, Tsipras, Janoos, Rudolph, Madry · ICLR 2020 — why code-level details dominate PPO vs TRPO
[04]
High-Dimensional Continuous Control Using Generalized Advantage Estimationpaper
Schulman, Moritz, Levine, Jordan, Abbeel · ICLR 2016 — the GAE paper used inside PPO
[05]
The 37 Implementation Details of Proximal Policy Optimizationblog
Huang, Dossa, Raffin, Kanervisto, Wang · ICLR Blog 2022 — the practical companion to Engstrom 2020