Proximal Policy Optimization
The stable RL algorithm behind RLHF.
Vanilla policy gradients have one unforgivable failure mode: the untrusted jump. You sample a batch of trajectories, you run gradient descent on the log-likelihood objective, and one unlucky step lands you somewhere the batch never visited. If a single advantage estimate is unusually large, or a single ratio unusually far from 1, the new policy is so different from the old one that your data turns into noise the moment you finish the step. You don't recover. The policy lies on the floor collecting zero reward, and the run is over.
The fix is a trust region — a fence around the old policy that says “step inside me, not outside.” In 2015, Schulman's Trust-Region Policy Optimization (TRPO) enforced the fence with math: every update solved a constrained problem keeping the new policy within a tiny KL distance of the old one. Beautiful — a second-order method with conjugate gradient and Fisher information matrices. It also takes 500 lines of code, runs slow, and nobody wants to maintain it.
In 2017, the same lab shipped Proximal Policy Optimization (PPO). The pitch: throw away the KL constraint. Replace it with a single clip on the importance ratio — a trust-region ratchet that only lets the policy turn forward by small steps. Try to climb past 1 + ε and the gradient flatlines. Try to fall past 1 − ε and the same thing happens on the other side. No catastrophic jumps. ~10× simpler code. Same empirical performance as TRPO, sometimes better. PPO became the default RL algorithm for continuous control, game playing, and — five years later — the backbone of RLHF. You saw it in the previous section dressed up for language models. Here we meet it naked.
I am a single line of loss. I do not solve optimization problems inside my forward pass. I do not need Fisher information. I clip the ratio, take the pessimistic minimum, and call .backward(). My trust region is a ratchet — one tooth of progress per step, no slipping back, no lunging forward. I am not elegant. I am pragmatic. That is why I won.Before the objective, the quantity the ratchet grips on to. The importance ratio at time t:
π_θ(a_t | s_t)
r_t(θ) = ───────────────
π_θ_old(a_t | s_t)π_θ_old is the policy that collected the data. π_θ is what you're currently optimizing. The ratio is the distance-meter: how much more (or less) likely the current policy is to take the same action the old policy took. r = 1 means the two agree; you haven't moved. r = 1.5 means the current policy is 50% more likely to do this action — you've walked half a step away from the data-collection distribution. r = 0.5 is the mirror: you're now half as likely to take what used to be a common action. The ratchet lives on this one number.
Why a ratio at all? Because PPO reuses each batch of trajectories for several gradient steps (that's the sample-efficiency trick). After the first step the current policy is no longer the one that collected the data — you're doing off-policy correction, and the ratio is the importance-sampling weight that keeps the math honest. Without a fence, those reuse steps would drift. With a fence, they can't.
L^CLIP(θ) = E_t [ min( r_t(θ) · A_t , clip(r_t(θ), 1−ε, 1+ε) · A_t ) ] with ε ≈ 0.2 (standard)
This is the ratchet, spelled out. Two moving parts. r_t(θ) we just defined. A_t is the advantage — positive when the action did better than expected, negative when it did worse. Everything else is the fence.
- If
A_t > 0(good action), the objective wantsr_t · A_tbig — raiseπ_θ(a_t|s_t). But theclipcapsr_tat1 + ε. Beyond that boundary the objective goes flat. You can walk closer to the good action; you cannot sprint there. One tooth of forward progress per step. - If
A_t < 0(bad action), the objective still wantsr_t · A_tbig (i.e. less negative) — lowerπ_θ(a_t|s_t). The clip floorsr_tat1 − ε. You can suppress the bad action, but not drive its probability to zero in one step. Same ratchet, other direction. - The
min(...)is the pessimistic minimum. It keeps the trust region one-sided: we only collect the clip's benefit when the unclipped ratio would have given a more optimistic gradient than the clipped one. Without themin, positive advantages withr > 1 + εwould still push the policy further; with it, the gradient dies right at the boundary. Fence intact, on both sides.
Look at the shape of L^CLIP as a function of r_t alone, with A_t held fixed. Toggle the sign of the advantage and watch the hinges at 1 − ε and 1 + ε flip sides. Those hinges are the ratchet's teeth.
This is what happens at the boundary. To the left of 1 − ε and to the right of 1 + ε, the clipped term is flat — zero gradient — so the objective stops pulling. The min() makes the stop one-sided: clipping only bites when it protects us. Push the ratio toward the “wrong” side of the trust region and the gradient goes to zero at the boundary; the policy is not allowed to keep walking. That is the proximal in the name — we stay proximate to the old policy, enforced not by a Lagrangian constraint but by a one-line clamp. The fence is cheap. The ratchet is cheap. That's the trick.
I am the fence at the edge of the trust region. I do nothing when the ratio is near 1. The moment the policy tries to step more than ε away in the wrong direction, I zero the gradient and end the conversation. I have no theoretical guarantees as strong as the KL constraint I replaced. I happen to work. TRPO spent 500 lines on what I do in one.PPO's other trick is multi-epoch reuse. Each batch of trajectories gets replayed through the update K times (typically 3 to 10). This is sample efficiency, and it's only safe because the clip keeps the ratio inside the trust region across all K inner steps. Without the ratchet, by epoch 3 you'd be taking gradient steps on a distribution that no longer resembles anything in the batch — the exact untrusted jump we opened with, dressed in a loop.
for iteration = 1, 2, ...
# 1. collect a batch with the current policy
trajectories ← rollout(π_θ_old) # N timesteps across parallel envs
# 2. compute advantages once (outside the inner loop!)
A_t ← GAE(rewards, values, λ=0.95, γ=0.99)
# 3. K epochs of optimization on the same batch
for epoch = 1, ..., K
for minibatch in shuffle(trajectories)
r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)
L_policy = −E[ min( r · A, clip(r, 1−ε, 1+ε) · A ) ]
L_value = E[ (V_θ(s_t) − R_t)² ] # optionally clipped too
L_entropy = −E[ H(π_θ(·|s_t)) ] # exploration bonus
loss = L_policy + c_v · L_value + c_e · L_entropy
∇loss → optimizer step
# 4. promote the current policy to "old"
π_θ_old ← π_θFour pieces. Rollout collects the data. GAE (Generalized Advantage Estimation, Schulman 2016) builds the advantage estimates with a bias-variance knob λ — 0.95 is the canonical value. The inner K-epoch loop is where the policy actually learns, each step limited by the ratchet. Then — and this is the part people forget — we replace π_θ_old with the current weights before the next rollout. That swap is what keeps the trust region meaningful: π_θ_old is always the policy that collected the most recent data, so the fence is always pitched around where the batch actually came from.
The 2017 paper's headline: across MuJoCo continuous-control benchmarks, PPO matches or beats TRPO on final return while being an order of magnitude cheaper to implement and wall-clock faster to train. A cartoon of the comparison:
max_θ E[ rθ · A ] subject to KL(πθ ‖ π_old) ≤ δ
- solve with conjugate gradient + Fisher-vector product
- backtracking line search to satisfy constraint
- exact trust region, hard KL bound
max_θ E[ min(rθ · A, clip(rθ, 1−ε, 1+ε) · A) ]
- plain SGD / Adam, no second-order anything
- clipping makes KL soft but bounded in practice
- one-line objective, ~100 LoC end-to-end
PPO keeps most of TRPO's stability while dropping the machinery. The clip does a first-order approximation of "stay inside a trust region" without computing Hessians. You pay a small price in worst-case KL control; you gain a 3× speedup and a massively simpler codebase.
The bar chart understates it. TRPO requires Fisher information matrices, conjugate gradient, backtracking line search to stay inside the trust region — each of which is its own research paper. PPO replaces all of it with a clamp. This is why PPO, not TRPO, became the default RL algorithm for roughly five years (2017–2021), powered OpenAI Five (Dota 2), the imitation-bootstrapped phase of AlphaStar, and — to bring this full circle — InstructGPT's RLHF. The previous section was PPO aimed at language generation; this section is PPO aimed at classical control. Same ratchet. Different state and action spaces.
I measure the distance between the policy that collected this data and the policy you're currently optimizing. When I am 1, the update is on-policy and the math is trivial. When I drift toward the boundary, the clip catches me. I am the reason you can reuse a batch for 10 epochs and still trust the gradient.
Three layers, as always. Pure-Python PPO on CartPole to see the whole algorithm in 60 lines. NumPy with explicit GAE. Full PyTorch with clipping, value loss, entropy bonus, and multi-epoch training — the version you'd actually deploy.
import gymnasium as gym
import numpy as np
import math
# Tiny 2-layer MLP policy, plain Python (no autograd) — just to see the skeleton.
# In practice you'd use PyTorch; this is to make the algorithm readable.
env = gym.make("CartPole-v1")
STATE_DIM, N_ACTIONS = 4, 2
EPSILON, GAMMA, LR = 0.2, 0.99, 0.01
# Policy parameters — a single linear layer softmax, small enough to update by hand.
W = np.random.randn(STATE_DIM, N_ACTIONS) * 0.01
def softmax_probs(s, W):
logits = s @ W
e = np.exp(logits - logits.max())
return e / e.sum()
def logprob(s, a, W):
return math.log(softmax_probs(s, W)[a] + 1e-12)
for iteration in range(100):
# 1. Rollout: collect one episode under π_old (= current W snapshot)
W_old = W.copy()
states, actions, rewards = [], [], []
s, _ = env.reset()
done = False
while not done:
probs = softmax_probs(s, W_old)
a = np.random.choice(N_ACTIONS, p=probs)
s2, r, term, trunc, _ = env.step(a)
states.append(s); actions.append(a); rewards.append(r)
s = s2
done = term or trunc
# 2. Returns-to-go as a naive advantage proxy (full GAE comes in layer 2)
returns = np.zeros(len(rewards))
G = 0.0
for t in reversed(range(len(rewards))):
G = rewards[t] + GAMMA * G
returns[t] = G
advantages = (returns - returns.mean()) / (returns.std() + 1e-8)
# 3. K=4 epochs of PPO-clip updates on this batch
for _ in range(4):
for s, a, A in zip(states, actions, advantages):
lp_new = logprob(s, a, W)
lp_old = logprob(s, a, W_old)
ratio = math.exp(lp_new - lp_old)
clipped = max(1 - EPSILON, min(ratio, 1 + EPSILON))
# gradient of min(r·A, clip(r)·A) w.r.t. W, done by finite-diff-ish
# scalar rule: if clip is inactive, grad is A · ∇log π(a|s)
# if clip is active on the binding side, grad is 0
use_unclipped = (ratio * A) <= (clipped * A)
if use_unclipped:
probs = softmax_probs(s, W)
grad_logpi = -np.outer(s, probs)
grad_logpi[:, a] += s
W += LR * A * grad_logpi
if iteration % 10 == 0:
print(f"iter {iteration:02d} mean_return={sum(rewards):.1f}")iter 00 mean_return=18.2 iter 10 mean_return=41.5 iter 30 mean_return=122.7 iter 60 mean_return=194.3 iter 90 mean_return=200.0 # solved (CartPole cap)
That's the algorithm end-to-end: collect, compute advantages, K epochs of clipped updates, promote π_old, repeat. The ratchet is the single line clipped = max(1 − ε, min(ratio, 1 + ε)). Now vectorize the advantage computation with proper GAE and stop computing gradients by hand.
import numpy as np
GAMMA, LAMBDA = 0.99, 0.95
def compute_gae(rewards, values, dones, last_value):
"""
GAE — Schulman 2016. Trades bias for variance via λ.
δ_t = r_t + γ V(s_{t+1}) − V(s_t) # one-step TD error
A_t = δ_t + (γλ) · δ_{t+1} + (γλ)² · δ_{t+2} + ... # exponentially weighted sum
"""
T = len(rewards)
advantages = np.zeros(T, dtype=np.float32)
gae = 0.0
for t in reversed(range(T)):
next_value = last_value if t == T - 1 else values[t + 1]
next_nonterminal = 1.0 - dones[t]
delta = rewards[t] + GAMMA * next_value * next_nonterminal - values[t]
gae = delta + GAMMA * LAMBDA * next_nonterminal * gae
advantages[t] = gae
returns = advantages + values # used as value-function target
return advantages, returns
# Usage inside a PPO rollout:
# rewards: (T,) values: (T,) dones: (T,) last_value: scalar (bootstrap for truncated traj)
# advantages, returns = compute_gae(rewards, values, dones, last_value)
# advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8) # normalizereturns = cumsum of discounted rewards←→GAE with λ=0.95 on top of value baseline— lower variance, controlled bias — the canonical advantage
per-episode loop←→(T,)-shaped arrays, one GAE call— GAE is a reverse scan; naturally vectorizable
advantages from returns alone←→advantages = GAE(r, V, dones, V_last)— subtract value baseline for variance reduction
And the real thing. PyTorch, with the value head, entropy bonus, minibatch shuffling, and K-epoch loop. This is within shouting distance of the reference implementation you'd find in stable-baselines3.
import torch
import torch.nn as nn
import torch.nn.functional as F
class ActorCritic(nn.Module):
def __init__(self, obs_dim, n_actions):
super().__init__()
self.shared = nn.Sequential(nn.Linear(obs_dim, 64), nn.Tanh(),
nn.Linear(64, 64), nn.Tanh())
self.pi = nn.Linear(64, n_actions) # policy logits
self.v = nn.Linear(64, 1) # value head
def forward(self, obs):
h = self.shared(obs)
return self.pi(h), self.v(h).squeeze(-1)
EPSILON, VF_COEF, ENT_COEF = 0.2, 0.5, 0.01
K_EPOCHS, MINIBATCH_SIZE = 4, 64
policy = ActorCritic(obs_dim=8, n_actions=4) # e.g. LunarLander
optimizer = torch.optim.Adam(policy.parameters(), lr=3e-4)
def ppo_update(obs, actions, old_logprobs, advantages, returns):
# Normalize advantages at the batch level (canonical PPO trick — reduces variance).
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
for _ in range(K_EPOCHS):
# Shuffle indices into minibatches — DO NOT recompute advantages here.
idx = torch.randperm(len(obs))
for start in range(0, len(obs), MINIBATCH_SIZE):
mb = idx[start : start + MINIBATCH_SIZE]
logits, values = policy(obs[mb])
dist = torch.distributions.Categorical(logits=logits)
new_logprobs = dist.log_prob(actions[mb])
entropy = dist.entropy().mean()
# PPO-Clip objective
ratio = torch.exp(new_logprobs - old_logprobs[mb])
surr1 = ratio * advantages[mb]
surr2 = torch.clamp(ratio, 1 - EPSILON, 1 + EPSILON) * advantages[mb]
loss_pi = -torch.min(surr1, surr2).mean()
# Value loss — MSE to the computed returns
loss_v = F.mse_loss(values, returns[mb])
# Total loss — policy + weighted value + entropy bonus (minus because maximize)
loss = loss_pi + VF_COEF * loss_v - ENT_COEF * entropy
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(policy.parameters(), 0.5) # also standard
optimizer.step()update 001 return=+28.4 kl=0.012 clipfrac=0.08 loss_pi=-0.023 loss_v=0.41 update 020 return=+104.2 kl=0.024 clipfrac=0.15 loss_pi=-0.041 loss_v=0.28 update 050 return=+248.7 kl=0.031 clipfrac=0.22 loss_pi=-0.048 loss_v=0.17 update 100 return=+487.1 kl=0.028 clipfrac=0.19 loss_pi=-0.039 loss_v=0.09
hand-coded softmax and log-prob←→torch.distributions.Categorical(logits=...)— autograd tracks log_prob and entropy natively
W += LR · A · grad_logpi←→loss.backward(); optimizer.step()— Adam handles the update; we just define the loss
per-episode loop←→minibatch shuffle, K epochs, clip_grad_norm— canonical PPO scaffolding — GAE once, K updates, promote
policy only←→policy + value head + entropy bonus— shared backbone; value stabilizes, entropy prevents collapse
Computing advantages inside the K-epoch loop: the classic bug. Advantages must be computed once, before the epoch loop, using the value function that was alive when the data was collected. Recompute them every epoch with the updated value function and you're chasing your own tail — the learning signal becomes incoherent, training diverges.
Not clipping the value function: reference PPO clips both the policy ratio and the value update: v_clipped = v_old + clip(v_new − v_old, −ε, +ε), then L_v = max(MSE(v_new, R), MSE(v_clipped, R)). The same trust-region ratchet, applied to the critic. Without it the value head can diverge under multi-epoch reuse, which wrecks the advantages, which wrecks the policy.
Too-large ε: at ε = 0.5 the trust region is so loose the clip barely triggers, and you're back to vanilla policy gradient with all its untrusted-jump instability. Stay at 0.1–0.3. 0.2 is standard and works almost everywhere.
Too many epochs per batch: after epoch K, the current policy is roughly K · ε steps from the data-collection policy. At K = 4 with ε = 0.2, that's fine. At K = 20 most samples are pinned at the clip boundary, the effective gradient is zero, and any that aren't are pushing the policy into territory the ratchet was never designed to handle. 4–10 epochs is the working range; 4 is the safe default.
Forgetting to normalize advantages: without per-batch normalization, advantage scale varies with reward scale, and PPO's effective step size varies with it too. Normalize to mean-zero unit-variance at the batch level. Tiny change in code, large change in stability.
Using the wrong old logprobs: π_θ_old must be the logprobs captured at rollout time, frozen. Recompute them under the current θ each epoch and every ratio becomes exactly 1 — the clip never triggers, the ratchet is disengaged, and you have silently reverted to a weird form of on-policy gradient ascent with zero safeguards.
Using the layer-3 PyTorch scaffold above (or stable-baselines3 if you're short on time — same algorithm, more tested), train PPO on LunarLander-v2. Target: mean episodic return > 200 over the last 100 episodes. That's the official “solved” threshold.
Start with the standard hyperparameters: ε = 0.2, γ = 0.99, λ = 0.95, K = 4 epochs, lr = 3e-4, 2048 steps per rollout, minibatch 64. Log three curves per update: (a) mean episodic return, (b) approximate KL between π_θ and π_θ_old (should stay under 0.02 when the ratchet is behaving), (c) clip fraction — the proportion of samples that hit the boundary of the trust region. A healthy run has clip fraction in 0.1–0.3.
Bonus: rerun with ε = 1.0 (the ratchet is effectively removed). Plot return and KL on the same axes. You should see a fast climb, a catastrophic collapse once KL explodes past the trust region, and a run that never recovers. That single plot is the best case for PPO's existence you will ever make.
What to carry forward. PPO replaces TRPO's hard KL constraint with a cheap clip on the importance ratio — a trust-region ratchet that lets the policy turn forward by small steps and refuses to let it lunge. The min() keeps the ratchet one-sided; K-epoch reuse gives you sample efficiency you could never get from vanilla policy gradients; promoting π_θ_old between iterations keeps the fence pitched around the right policy. The algorithm fits on a screen. The implementation details — normalization, value clipping, grad clipping, GAE — account for half its practical performance, so never trust a PPO result without the code. This is the workhorse that gave you Dota-playing bots in 2019 and ChatGPT in 2022.
Next up — MoE Fundamentals. So far every model in this curriculum has activated all of its parameters on every token. Scaling means making that single stack of activations bigger and bigger, and eventually the FLOPs bill catches up with you. Mixture of Experts changes the deal: grow the parameter count without growing the FLOPs per token, by routing each token through only a small subset of specialists. The next section starts with why sparse activation is the next axis of scale — and why the router that picks which experts fire is suddenly the hardest part of the network to train.
If a step here felt fast, revisit these first.
Natural continuations that build directly on this.
In PPO's clipped surrogate objective L = min(r·A, clip(r, 1−ε, 1+ε)·A), why the min()?
- [01]Schulman, Wolski, Dhariwal, Radford, Klimov · arXiv 2017 — the original PPO paper
- [02]Schulman, Levine, Abbeel, Jordan, Moritz · ICML 2015 — the TRPO paper PPO replaced
- [03]Engstrom, Ilyas, Santurkar, Tsipras, Janoos, Rudolph, Madry · ICLR 2020 — why code-level details dominate PPO vs TRPO
- [04]Schulman, Moritz, Levine, Jordan, Abbeel · ICLR 2016 — the GAE paper used inside PPO
- [05]Huang, Dossa, Raffin, Kanervisto, Wang · ICLR Blog 2022 — the practical companion to Engstrom 2020