PPO for RLHF

Policy optimization against a learned reward model.

Hard

~15 min read

·lesson 5 of 6

Picture a helium balloon. You release it indoors and it rises, gently, toward whatever the ceiling says is “up.” Now attach a string. The balloon still rises, but only as far as the string lets it. You can lengthen the string. You can shorten it. You cannot remove it — because the moment you do, the balloon drifts off into a heating vent and you never see it again.

That is the whole lesson. The balloon is a language model. Up is whatever the reward model smells good. The string is the KL divergence back to the original SFT checkpoint — the tether that keeps the policy from floating into gibberish-land. Every hyperparameter knob in RLHF is ultimately a knob on that string.

The algorithm that does the lifting is PPO — Proximal Policy Optimization, a classic continuous-control RL method from 2017 that got adapted — with a few load-bearing hacks — into the default engine of RLHF. Every major chat model you used between roughly 2022 and 2024 was PPO-finetuned against a reward model. This lesson is about what the adaptation cost, why the string matters, and what happens when you cut it.

First, translate text generation into RL vocabulary. Same balloon, just with labels taped to it.

Policy π_θ: the language model. Given a state, it outputs a distribution over next tokens. This is the balloon.
State s_t: the prompt plus every token generated so far. It grows as you decode.
Action a_t: the next token. The action space is the full vocabulary — 50k, 100k, 200k — so this is a high-dimensional discrete control problem.
Reward r_t: the RM score at the end of generation, plus a per-token KL penalty pulling the balloon back toward the frozen reference on the other end of the string.
Episode: one generation, from the prompt to EOS or a length cap.

The whole reward only materialises when the episode ends. The RM reads the finished completion and hands back a single scalar. No per-token feedback. That is a nasty credit-assignment problem — which token earned the praise? — and the value head (coming up) is what papers over it.

Here is the per-token reward used in practice, written the way you will see it in every RLHF paper. Read it as two pieces glued together: a string and a destination.

per-token reward — RM at the end, KL everywhere

r_t  =  − β · log( π_θ(a_t | s_t) / π_ref(a_t | s_t) )   +   RM(prompt, completion) · 1[t = T]

       └──────────── per-token KL penalty ───────────┘   └──── sparse terminal reward ────┘

The right-hand term is the destination — the reward model's score, handed out exactly once, at the last token. That is what the balloon is rising toward. The left-hand term is the string. At every token, the model pays a tax proportional to how far it has drifted from what the reference (the frozen SFT checkpoint) would have said in the same spot. Drift further, pay more. The string pulls taut.

The length of the string is β. It typically sits in [0.01, 0.1]. Small β means a long string — the balloon is free to roam and chase reward. Large β means a short string — it stays glued to the SFT baseline and barely updates at all. You are going to slide that knob yourself in a minute.

The RLHF loop end-to-end. Prompt goes in. Policy generates a completion. The reward model scores the completion. Compute advantages per token (RM score minus value baseline, minus KL). PPO update on the policy and value head. Loop.

RLHF pipeline — one training step

prompt → policy → response → reward − β·KL → update

edge—

selectedpolicy

Watch the loop indicator. Something oddly recursive is happening: the RM produces scalar rewards; PPO treats them as ground truth; the balloon rises toward whatever the RM is highest on. Whether the balloon actually got better depends entirely on whether the RM is a good proxy for human preference. If the RM is wrong in a systematic way, PPO finds that wrongness and floats directly into it. We call that reward hacking, and it is the whole reason the tether exists.

Concretely, what does a balloon with no string look like? The model learns to write prose that is maximally RM-flattering and minimally readable: paragraph-long restatements of the question, piled-up buzzwords the RM has a weakness for, confident-sounding filler that no human would ever produce. Reward goes to the moon. Quality falls off a cliff. The balloon has floated into the ceiling vent.

KL penalty (personified)

I am the string. Without me the balloon floats off into reward-hacking nonsense — buzzword salads, dodged questions, outputs the reward model happens to like but no human does. I keep you close to the SFT checkpoint, where the text still sounds like text. Tune me carefully: too long and your model turns into a slot machine, too short and it never leaves the floor.

The PPO update itself. You take a batch of completions, compute each token's advantage A_t (how much better it did than expected), and then take a gradient step on this clipped objective.

PPO-Clip — the core objective

L^CLIP(θ)  =  E_t [  min(  r_t(θ) · A_t ,  clip(r_t(θ), 1−ε, 1+ε) · A_t  )  ]

with    r_t(θ)  =   π_θ(a_t | s_t)  /  π_θ_old(a_t | s_t)

and     ε       ≈   0.2

Three things happening. r_t(θ) is the probability ratio: how much more (or less) likely the new policy makes the action compared to the policy that collected the data. A_t is the advantage — positive means “this action was better than baseline, do more of it”; negative means the opposite. And the clip is the “proximal” in Proximal Policy Optimization: it forbids ratios outside [1−ε, 1+ε], so no single update can yank the balloon more than ~20% away from where it started.

The clip is why PPO is stable. Pure policy gradient has a known failure mode: one huge advantage on one unusual trajectory can take a catastrophic step and shred the policy. Clipping puts a ceiling on how much any one token can move you. Crude. It works. (If you want the full derivation of why the ratio form is natural and why clipping beats a hard KL constraint, the proximal-policy-optimization lesson in the RL chapter is where it lives — this lesson is the RLHF-specific application.)

Slide the β coefficient below — you are literally adjusting how long the string is. Watch two curves: the reward, which tells you how high the balloon has risen, and the KL-to-SFT, which tells you how far from the SFT reference it has drifted. There is no free lunch; you buy reward with KL, and buy too much and the text becomes unreadable.

KL penalty — the leash between policy and reference

loss = E[r] − β · KL(π_θ ‖ π_ref)

β0.050

KL1.125

E[r]0.950

obj − β·KL0.894

At β = 0: no string. Reward rockets up, KL explodes, and a human reading the outputs says “this is gibberish with high-RM-score keywords.” At large β: string so short the balloon cannot get off the floor — reward barely moves, the policy cannot drift anywhere without paying a prohibitive KL tax, and you have effectively done no RL at all. The sweet spot is roughly where the reward curve is still rising but KL has flattened into a steady band. That is the configuration papers like InstructGPT report.

Value head (personified)

I'm a small scalar regression head bolted onto the policy. My one job is to predict the expected return from every state. Subtract my prediction from the actual return and you get the advantage — the signed “surprise” that tells PPO which actions to reinforce. I'm trained jointly with the policy on a simple MSE loss against the observed returns. I'm also why every RLHF run needs double the weights of just the policy.

Three layers, same progression used everywhere else in this series. A pure-Python PPO update on a toy 2-action bandit so you can see the clip in isolation. A PyTorch skeleton of the full RLHF loop with the four models wired up. And the real-world version — HuggingFace TRL's PPOTrainer running on an actual LM. Same algorithm, three rungs of abstraction.

layer 1 — pure python · ppo_toy.py

python

import math

# Toy 2-action "environment". Action 0 has true reward 1, action 1 has reward 0.
# Our "policy" is a single logit; the probability of action 0 is sigmoid(logit).
# Start slightly wrong: logit=0 => 50/50. We collected data under logit=0 (= "old policy").

EPSILON = 0.2          # PPO clip range
LR      = 0.5

def sigmoid(z):
    return 1.0 / (1.0 + math.exp(-z))

def prob_action0(logit):
    return sigmoid(logit)

# Data: one trajectory, we took action 0 and got reward 1.
# Advantage is (reward - baseline); baseline = 0 for this toy. So A = +1.
action, advantage = 0, 1.0
logit_old = 0.0                                 # policy that collected the data
pi_old    = prob_action0(logit_old)             # = 0.5

logit = logit_old
for step in range(5):
    pi          = prob_action0(logit)
    ratio       = pi / pi_old                   # r_t(θ) = π_θ(a) / π_old(a)
    clipped     = max(1 - EPSILON, min(ratio, 1 + EPSILON))
    # PPO-Clip objective (we take the MIN to be pessimistic about the advantage)
    loss_term   = min(ratio * advantage, clipped * advantage)
    print(f"step {step:02d}  advantage={advantage:+.2f}  "
          f"ratio={ratio:.3f}  clipped={clipped:.3f}  loss={-loss_term:.4f}")
    # "Gradient ascent" on the logit (positive advantage on action 0 => raise logit)
    # Use the clipped side — once we hit 1+ε we stop moving, exactly as PPO intends.
    if ratio >= 1 + EPSILON:
        break                                   # clipped: no further update this batch
    logit += LR * advantage                     # toy update rule

stdout

step 00  advantage=+1.00  ratio=1.000  clipped=1.000  loss=-1.0000
step 01  advantage=+1.00  ratio=1.082  clipped=1.082  loss=-1.0820
step 02  advantage=+1.00  ratio=1.170  clipped=1.170  loss=-1.1700
step 03  advantage=+1.00  ratio=1.200  clipped=1.200  loss=-1.2000   # hit the clip
step 04  advantage=+1.00  ratio=1.200  clipped=1.200  loss=-1.2000   # pinned at 1+ε

That is PPO-Clip stripped of everything else: one logit, one reward, one ratio, one clip. Now scale it up. A real RLHF loop holds four models, samples trajectories, computes a value baseline, subtracts the KL (the string), and does a batched policy update.

layer 2 — pytorch skeleton · rlhf_loop.py

python

import torch
import torch.nn.functional as F

# The four models (pretend these are all loaded LMs).
policy       = load_lm(sft_checkpoint, trainable=True)     # π_θ
ref_model    = load_lm(sft_checkpoint, trainable=False)    # π_ref  (frozen SFT copy)
reward_model = load_rm(rm_checkpoint, trainable=False)     # RM     (frozen)
value_head   = torch.nn.Linear(policy.d_model, 1)           # small scalar head

optimizer = torch.optim.AdamW(
    list(policy.parameters()) + list(value_head.parameters()), lr=1e-6
)

BETA, EPSILON, PPO_EPOCHS = 0.05, 0.2, 4

for prompts in prompt_loader:                              # 1. sample prompts
    with torch.no_grad():
        completions, old_logprobs = policy.generate(prompts, return_logprobs=True)
        ref_logprobs              = ref_model.logprobs(prompts, completions)
        rm_scores                 = reward_model.score(prompts, completions)      # [B]
        values                    = value_head(policy.hidden(prompts, completions))  # [B, T]

    # 2. per-token rewards: RM at the end, KL everywhere
    kl_per_tok = old_logprobs - ref_logprobs              # log π_θ_old − log π_ref
    rewards    = -BETA * kl_per_tok                       # shape [B, T]
    rewards[:, -1] += rm_scores                           # add terminal RM score

    # 3. compute advantages (GAE in real code, simplified here)
    returns    = torch.cumsum(rewards.flip(-1), -1).flip(-1)     # naïve return-to-go
    advantages = returns - values.detach()

    # 4. PPO update — multiple epochs on the same batch
    for _ in range(PPO_EPOCHS):
        new_logprobs = policy.logprobs(prompts, completions)     # fresh under π_θ
        ratio        = (new_logprobs - old_logprobs).exp()
        unclipped    = ratio * advantages
        clipped      = torch.clamp(ratio, 1 - EPSILON, 1 + EPSILON) * advantages
        policy_loss  = -torch.min(unclipped, clipped).mean()     # PPO-Clip

        new_values   = value_head(policy.hidden(prompts, completions)).squeeze(-1)
        value_loss   = F.mse_loss(new_values, returns)           # baseline fitting

        loss = policy_loss + 0.5 * value_loss
        optimizer.zero_grad(); loss.backward(); optimizer.step()

pure-python PPO → pytorch RLHF loop

one scalar logit←→full LM with per-token logprobs

— the "policy" is now a transformer over a vocabulary

reward = 1.0 (toy)←→rm_scores + per-token −β·KL

— RM only at the terminal token, KL at every token — string + destination

advantage = reward←→advantages = returns − value_head(states)

— subtract a learned baseline to reduce variance

one ratio, one clip←→per-token ratio, clipped, min() of both

— same PPO-Clip objective, batched over (B, T)

Nobody writes this loop from scratch in practice. You use HuggingFace's trl library, which wraps the whole thing. Here is roughly what calling it looks like — under 50 lines against a real LM.

layer 3 — huggingface TRL · ppo_trl.py

python

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("sft-checkpoint")
policy    = AutoModelForCausalLMWithValueHead.from_pretrained("sft-checkpoint")  # (1)+(2)
ref       = AutoModelForCausalLMWithValueHead.from_pretrained("sft-checkpoint")  # (3) frozen
rm        = AutoModelForSequenceClassification.from_pretrained("reward-model")   # (4) frozen

config = PPOConfig(
    learning_rate=1e-6,
    batch_size=64,
    mini_batch_size=8,
    ppo_epochs=4,
    init_kl_coef=0.05,       # β — the KL coefficient from the math
    target_kl=6.0,           # target KL; coef auto-adjusts if it drifts
    cliprange=0.2,           # ε — the PPO clip
)

trainer  = PPOTrainer(config, policy, ref, tokenizer)
dataset  = load_dataset("instruction-prompts", split="train")

for batch in dataset.iter(batch_size=config.batch_size):
    query_tensors    = [tokenizer(p, return_tensors="pt").input_ids for p in batch["prompt"]]
    response_tensors = trainer.generate(query_tensors, max_new_tokens=128)
    texts            = [tokenizer.decode(r[0]) for r in response_tensors]
    rewards          = [rm(**tokenizer(t, return_tensors="pt")).logits[0] for t in texts]
    stats = trainer.step(query_tensors, response_tensors, rewards)                  # PPO update
    print(f"reward={sum(rewards)/len(rewards):+.2f}  kl={stats['objective/kl']:.3f}")

stdout

step 000  reward=+0.12  kl=0.015  policy_loss=-0.041  value_loss=0.28
step 020  reward=+0.58  kl=0.082  policy_loss=-0.067  value_loss=0.21
step 060  reward=+1.24  kl=0.190  policy_loss=-0.055  value_loss=0.17
step 100  reward=+1.71  kl=0.240  policy_loss=-0.048  value_loss=0.15

pytorch skeleton → trl

manual 4-model setup + loop←→PPOTrainer(policy, ref, tokenizer)

— trl hides the model-juggling; you hand it the four pieces

BETA, EPSILON, PPO_EPOCHS←→PPOConfig(init_kl_coef, cliprange, ppo_epochs)

— same hyperparameters, different names — TRL uses adaptive β (the string auto-tightens)

manual advantage + value loss←→trainer.step(queries, responses, rewards)

— GAE, clipping, value fitting all inside .step()

Time to stare directly at the failure mode the string exists to prevent. Reward hacking is what happens when the balloon floats toward the reward model's idea of up and the reward model's idea of up is subtly wrong. Concretely, the policy learns:

Repetition and buzzwords — if the RM was trained on preference data where long, confident-sounding answers scored well, the policy will float directly into paragraph-long keyword soup.
Sycophancy — if the RM learned that human raters prefer answers that agree with the premise of the question, the policy starts agreeing with everything, including premises that are false.
Hedge spam — if the RM rewards caveats, the balloon discovers it can stack caveats until the answer contains no information at all.
Exploit-tokens — bizarre sequences of punctuation or whitespace that the RM has never been trained on but happens to score high on, purely by random surface in its score function.

The tether is what stops all of this. Every one of those failure modes requires drifting far from the SFT distribution — and every token of drift costs you KL, which costs you reward. The longer you make the string, the further the balloon can float into RM-land. The whole craft of RLHF is finding the string length at which the balloon rises high enough to be useful but not so high that it vanishes into reward-hacking territory. There is no closed-form answer. It is a knob you tune.

Gotchas

Using the base model as ref instead of the SFT model: the string is tied to the wrong post. KL is computed against a chatbot-illiterate base, so the balloon is tethered to pre-training instead of instruction-following. Always use the SFT checkpoint as π_ref.

Sign error on the KL penalty: if you add +β·KL to the reward instead of −β·KL, you are rewarding the policy for drifting from the reference. The string becomes a slingshot. Within a few hundred steps the model speaks a private language. Easy to hit, impossible to miss once you do.

Token-level vs sequence-level reward: the RM gives you one scalar per completion. If you accidentally broadcast it to every token position as a dense reward, you have inflated your advantage estimates by a factor of T. The canonical setup is RM at the terminal token only; KL distributed per token.

Updating the reference model by accident: if π_ref shares parameters with π_θ (e.g. you forgot to clone, or forgot requires_grad=False), the string is tied to the balloon. KL is always zero, the tether is disabled, and the balloon floats off. Reward will climb, KL will stay suspiciously flat. Check by sampling a few logprobs from π_ref before and after a step — they must be identical.

Too many PPO epochs per batch: the clipped objective is only valid for ratios close to 1. Run 20 epochs on the same batch and most tokens will be pinned at the clip, gradients will be zero, and any that aren't will be pushing the policy further from the data-collection policy than the clip was designed to handle. Canonical setting is 4 epochs. More than that is asking for catastrophic collapse.

PPO on a small LM, 100 steps

Using HuggingFace TRL, run PPOTrainer against a small model (TinyLlama, Pythia-160m, or similar) for 100 steps on an instruction-tuning dataset. Use your SFT checkpoint as both the starting policy and the reference. Use a reward model you trained in the previous lesson.

Log two curves: (a) mean reward per batch, and (b) mean KL to reference per batch. Plot them on the same x-axis (step). You should see reward rise; KL rise more slowly; and, if init_kl_coef is tuned right, KL stabilise while reward keeps climbing. That flat KL band is the string holding.

Bonus: do the same run with init_kl_coef=0.0. Cut the string. Plot it on the same axes. Reward goes to the moon. Sample a completion at step 100 and read it. That is the balloon in the ceiling vent — reward hacking in its purest form, and the single-paragraph argument for why every RLHF pipeline ever shipped has a tether.

What to carry forward. PPO-for-RLHF is a balloon and a string. The balloon is the policy, floating toward whatever the reward model smells good. The string is the KL penalty back to the frozen SFT checkpoint — long enough to let the balloon rise, short enough to pull it back before it drifts into gibberish-land. The clip is what keeps any single update from yanking the balloon too hard. The four-model tax is what you pay for having both. The payoff, despite the cost, is the chatbot era.

Next up — Proximal Policy Optimization. We jumped straight to the RLHF application here, because that is where PPO earned its fame. But PPO itself is a proper RL algorithm with a clean derivation that predates language models by five years. The proximal-policy-optimization lesson in the reinforcement-learning chapter walks you through the ratcheting trust-region argument from first principles — why the ratio, why the clip, why this particular shape and not another. Go there when you want the full machinery; come back here when you want to remember which knob is the string.

References

[01]
Proximal Policy Optimization Algorithms
Schulman, Wolski, Dhariwal, Radford, Klimov · arXiv 2017 — the original PPO paper
[02]
Training language models to follow instructions with human feedback
Ouyang et al. · NeurIPS 2022 — InstructGPT, the canonical RLHF recipe
[03]
Fine-Tuning Language Models from Human Preferences
Ziegler et al. · arXiv 2019 — the paper that introduced RLHF for LMs
[04]
TRL — Transformer Reinforcement Learning
HuggingFace · library for RLHF / PPO / DPO on transformers
[05]
Learning to summarize from human feedback
Stiennon et al. · NeurIPS 2020 — RLHF applied to summarization, many of the tricks used later