Reward Modeling

Train a preference model from human pairwise comparisons.

Hard
~15 min read
·lesson 4 of 6

Imagine you're training a chef. You've sent them to cooking school — they can follow a recipe, plate a dish, not set the kitchen on fire. That's SFT. What it doesn't get you is a chef you'd want running your restaurant. Technically competent and actually good are different problems.

The obvious fix: taste every dish yourself. Score it out of ten. Send the chef back with the number. Do that a few million times and the chef learns what “good” means to you. One problem — you are one person, the chef cooks billions of meals a second, and you cannot taste that fast. Nobody can. The panel of reviewers is too slow.

So we cheat. We hire the taste tester — a second, smaller model whose entire job is to predict which of two dishes a human would prefer. Humans can't score every possible completion, but they can reliably tell you that this dish beats that dish. Collect enough of those pairwise verdicts, train the tester on them, and now the main model has a fast critic standing in for the slow panel. That critic is the reward model.

This lesson builds the taste tester. It sits between SFT and RL, it's the reason GPT-4 and Claude don't read like 2020-era language models, and its failure modes are subtle enough that most of the lesson is about them — because a bad tester will confidently declare the worst dishes to be the best.

Why pairs instead of scores? Because reviewers disagree on absolute numbers constantly. Hand three taste testers the same dish and ask for a rating out of ten, you'll get a 7, a 4, and an 8 — same dish, same palate being measured, wildly different answers. But show those same three people two dishes and ask which is better, and they agree most of the time. The signal lives in the comparison, not the score. So the reward model never tries to regress to a number. It only ever learns gaps.

To turn that signal into a reward signal we need a scalar — one number the next stage can climb. The magic that turns “A beats B” into a scalar has a name, it's from 1952, and it was originally invented to rank chess players.

Here's the trick the taste tester uses. Each dish has a hidden “strength” score — you don't know it, you never will, but you can recover it up to a shift by watching enough head-to-head tastings. The model is called Bradley-Terry, and it says: if dish A has latent strength r_A and dish B has strength r_B, the probability a reviewer picks A is a sigmoid of the difference.

Bradley-Terry — preference as a sigmoid of reward gap
P(A ≻ B | prompt)   =   σ( r(prompt, A) − r(prompt, B) )

                    =    1
                      ───────────────────────────────
                       1 + exp( −(r_A − r_B) )

A big positive gap r_A − r_B means the tester is very confident A wins. Zero gap means 50/50 — a coin flip between two dishes that taste about the same. Negative gap means B wins. Same sigmoid you've seen in binary classification, just applied to a difference of two scores instead of one raw logit.

From there the loss is ordinary cross-entropy. Every training example is a triple (prompt, chosen, rejected) — one dish a reviewer picked, one they didn't. We want the probability of that verdict to be high, so we minimize its negative log-likelihood:

reward-model loss — the whole training objective
ℒ(θ)  =  −𝔼_(x, y_w, y_l) ~ D  [  log σ( r_θ(x, y_w)  −  r_θ(x, y_l) )  ]

where
   x      =  prompt
   y_w    =  chosen  (winner)   — human preferred this
   y_l    =  rejected (loser)   — human rejected this
   r_θ    =  the reward model   — a scalar-valued neural net
   D      =  the preference dataset

Read that loss carefully, because it is not doing what you'd guess. It is not regressing toward a target reward. The tester has no idea what score the winning dish “deserves” — it only knows that whatever number it assigns the winner, it should be bigger than whatever it assigns the loser. The reward model learns a scale. The zero point floats. The unit floats. Only the gaps between dishes mean anything. This will come back and bite us later.

Here's what the training data actually looks like. One prompt, two responses, one reviewer's verdict. Picture a million of these — collected from contractors comparing completions from your current model against each other or against a reference.

preference pairs — label the winner, watch the RM loss
L = −log σ(r_chosen − r_rejected)
prompt #1
How do I sort a list in Python?
prompt #2
Write a haiku about debugging.
prompt #3
Explain backpropagation in one sentence.
labeled0 / 3
agreement
mean loss

Notice the labels are relative, never absolute. The reviewer isn't scoring either dish on a 1-10 scale. They're just pointing at one — “this one, over that one.” Three different panelists will give three wildly different 1-10 scores for the same response, but “left-or-right” stays stable across them. That stability is the entire reason Bradley-Terry beats direct regression, and the entire reason the taste tester exists at all.

Preference pair (personified)
I am one prompt, two dishes, and a reviewer who pointed at one of them. I am cheap to produce, noisy to interpret, and if you stack a hundred thousand of me, I can tell you what “helpful” tastes like — approximately, statistically, with all the cultural baggage of the panel that labeled me.

Now the network — the tester's actual palate. It's almost the same transformer you just fine-tuned. Take the SFT checkpoint, rip off the language-modeling head (which predicts the next token over a 50k-vocabulary), and bolt on a reward head — a single linear layer that maps the pooled hidden state to one scalar. That's it. One new matrix of shape [d_model, 1]. Same body, different mouthpiece.

reward-model architecture
tokens  ─►  transformer backbone  ─►  h ∈ ℝ^{T × d}     (hidden states)

                                │
                                ▼
                      pool over T (usually last non-pad token)
                                │
                                ▼
                          h_pool ∈ ℝ^d
                                │
                                ▼
                     W_r ∈ ℝ^{d × 1}   (the reward head)
                                │
                                ▼
                        r(x, y) ∈ ℝ      (one scalar)

A few architectural details worth knowing. You pool the hidden state at the last token — not the first, not the mean — because that position has attended to the entire sequence and carries the most complete impression of the whole dish. The pooled vector goes through a single linear layer to a scalar. No softmax, no bias gymnastics — just a dot product that collapses a rich representation into one number the tester is willing to stake its reputation on.

You initialize from the SFT checkpoint, not from scratch and not from the base pretrained model. The SFT model already knows the target format; starting there gives the reward head useful hidden states to classify on day one. The reward head itself is typically zero-initialized, which makes the initial gradient a pure function of the backbone's existing representations — the tester's tongue is blank on arrival, it has to learn the palate from the preference pairs alone.

Watch a completion flow through. Tokens go in, the transformer produces hidden states, the reward head collapses the pooled vector into one number. That number is the tester's verdict — and the thing PPO will spend every one of its steps trying to push higher.

reward head forward pass
click a stage to inspect it
TOKENSThequickbrownfoxjumps.TRANSFORMER0.20.50.31.10.80.6POOL: LAST0.000.000.000.000.001.00LINEAR HEADr = 0.820W · h + bSCALAR REWARD OUTPUT0−3+3stage: headreward head: a single linear layer — one scalar output r = W · h_pool + b.W = 1.70, b = -0.20 → r = 1.70 · 0.60 + (-0.20) = 0.820Trained so chosen responses get higher r than rejected ones (Bradley-Terry).
pooled0.600
reward r0.820

And now the problem that is genuinely hard — the one the entire alignment field has been shouting about for a decade. Your tester is only as good as its palate. The reward model is a proxy for human preference: a finite neural net, trained on a finite dataset, with finite coverage of a practically infinite space of completions. The RL policy — which we'll train next lesson — will relentlessly optimize against this proxy. And optimization against a proxy is optimization against the proxy's weaknesses.

Within a few hundred PPO steps, a well-tuned policy can find completions that score astronomically high under the reward model — higher than anything the tester saw in training — while being obviously terrible to any actual human. The policy has not gotten more helpful. It has learned what fools the tester's palate. This is reward hacking, and it is the central failure mode of RLHF.

The real-world tells the tester gets fooled by:

  • Length bias. Reviewers often confuse length with thoroughness, so the tester learns to prefer long dishes. The policy learns to pad. InstructGPT had to explicitly control for this.
  • Hedging. Polite hedging (“It's important to note...”) gets rewarded, so the policy begins every response with three hedges like a waiter apologizing before the starters arrive.
  • Formatting theater. Headers, bullets, emoji. The tester thinks “structured” = “good.” The policy structures everything into bullets, including things that are not lists.
  • Out-of-distribution exploits. The policy discovers a bizarre-looking token sequence that the tester happens to score highly, and produces that sequence. Looks like garbled text to a human, looks like dessert to the reward model.
Reward hacking (personified)
I am what happens when you confuse the tester's tongue with the actual meal. Your reward model is a finite, noisy sketch of “what humans want.” Your policy is a brilliant, patient optimizer. Give it a few thousand steps and it will find every seam in the sketch. The fix is not a better tester. The fix is to stop letting the policy wander far from where the tester was trained.

Three layers, same algorithm, each one smaller than the last. First, the Bradley-Terry loss in pure NumPy — ten lines, no frameworks, just the negative log-likelihood of a sigmoid of a gap. Then the reward model as a AutoModelForSequenceClassification with num_labels=1 — the canonical HuggingFace pattern that does the pooling and the scalar head for you. Then HuggingFace's trl.RewardTrainer, which ships the whole training loop and expects your dataset in (chosen, rejected) format.

layer 1 — numpy · bradley_terry_loss.py
python
import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def bradley_terry_loss(r_chosen, r_rejected):
    """
    Negative log-likelihood of the chosen response beating the rejected one
    under the Bradley-Terry model. Shape: (batch,) → scalar.
    """
    gap = r_chosen - r_rejected                    # the only thing that matters
    # log σ(gap) = -log(1 + exp(-gap)) — stable form is -log sigmoid
    per_pair = -np.log(sigmoid(gap) + 1e-12)
    return per_pair.mean(), per_pair

# Dummy batch of 4 preference pairs — these would be r_θ(x, y_w), r_θ(x, y_l)
r_chosen   = np.array([ 2.1,  0.8, -0.3,  1.5])
r_rejected = np.array([ 0.9, -1.2, -0.5, -0.2])

loss, per_pair = bradley_terry_loss(r_chosen, r_rejected)
print("chosen rewards:  ", r_chosen)
print("rejected rewards:", r_rejected)
print("per-pair loss:   ", np.round(per_pair, 4))
print(f"batch loss:       {loss:.4f}")
stdout
chosen rewards:   [ 2.1  0.8 -0.3  1.5]
rejected rewards: [ 0.9 -1.2 -0.5 -0.2]
per-pair loss:    [0.2633 0.1269 0.5981 0.1773]
batch loss:       0.2914

That is the entire learning signal the tester ever sees. Everything else — the transformer, the attention, the pooling — exists only to produce the two scalars that get subtracted on line 13. Tester tastes winner. Tester tastes loser. Gap should be positive. Done.

Now the real thing. HuggingFace exposes any causal LM as a sequence-classification head via AutoModelForSequenceClassification. Set num_labels=1 and you get a model whose final layer emits one scalar per input — exactly the reward head we drew on the whiteboard.

layer 2 — pytorch + huggingface · reward_model.py
python
import torch
import torch.nn.functional as F
from transformers import AutoModelForSequenceClassification, AutoTokenizer

MODEL = "meta-llama/Llama-3-8B-Instruct"     # or your SFT checkpoint

tokenizer = AutoTokenizer.from_pretrained(MODEL)
tokenizer.pad_token = tokenizer.eos_token    # needed for padded batches

# num_labels=1 swaps the LM head for a scalar regression head.
reward_model = AutoModelForSequenceClassification.from_pretrained(
    MODEL, num_labels=1, torch_dtype=torch.bfloat16,
)
reward_model.config.pad_token_id = tokenizer.pad_token_id

def reward(prompt: str, completion: str) -> torch.Tensor:
    text = prompt + completion
    ids = tokenizer(text, return_tensors="pt").to(reward_model.device)
    out = reward_model(**ids)
    # out.logits has shape [batch, 1] — this IS r_θ(x, y).
    return out.logits.squeeze(-1)

def bt_loss(r_chosen, r_rejected):
    # F.logsigmoid is numerically stable — do NOT roll your own.
    return -F.logsigmoid(r_chosen - r_rejected).mean()

# One training step, hand-rolled.
prompt = "Explain gradient descent in one paragraph."
r_w = reward(prompt, " It is iterative optimization along −∇L.")
r_l = reward(prompt, " gradient is a thing that descends sometimes")
loss = bt_loss(r_w, r_l)
loss.backward()
print(f"r(chosen)={r_w.item():.3f}  r(rejected)={r_l.item():.3f}  loss={loss.item():.3f}")
numpy → pytorch + HF
-np.log(sigmoid(gap) + 1e-12)←→-F.logsigmoid(gap)

stable log σ — one op, no underflow in the tails

custom scalar head + manual pooling←→AutoModelForSequenceClassification(num_labels=1)

HF handles last-token pooling + linear head for you

r_chosen, r_rejected (two arrays)←→two forward passes (or concatenated, one pass)

in practice you cat [chosen, rejected] and split the output

And finally, the production recipe. TRL (“Transformer Reinforcement Learning,” HuggingFace's post-training library) ships a dedicated RewardTrainer that wraps all of the above — batching, loss, logging, gradient accumulation, LoRA if you want it. You show up with a dataset of chosen/rejected pairs and a base model; it does the rest.

layer 3 — trl · train_rm.py
python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfig

MODEL = "Qwen/Qwen2.5-1.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=1)
model.config.pad_token_id = tokenizer.pad_token_id

# HH-RLHF — Anthropic's Helpful-Harmless preference dataset.
# Columns: "chosen" (full conversation + preferred reply),
#          "rejected" (same conversation + rejected reply).
ds = load_dataset("Anthropic/hh-rlhf")

cfg = RewardConfig(
    output_dir="rm-out",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-5,
    max_length=1024,
    logging_steps=50,
    bf16=True,
)

trainer = RewardTrainer(
    model=model,
    args=cfg,
    processing_class=tokenizer,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
)

trainer.train()     # one epoch on HH-RLHF ≈ the classic reward-model recipe
Gotchas

KL anchor to the right reference: when PPO later uses KL to the SFT model, make sure the tester was alsoinitialized from that same SFT model. If the reward model was initialized from the base pretrained model, its internal scale is calibrated to a different distribution and the KL math stops meaning what you think it means.

Length bias: testers trained on naive human preferences almost always reward length. Check: correlate reward with completion length on a held-out set. If r² > 0.3, your tester has baked in a length preference and PPO is about to turn every response into an essay.

Label noise from inconsistent reviewers: inter-rater agreement on preference pairs is often 70-75%. That is a hard ceiling on tester accuracy — no amount of scale can push past it. Clean the data (multiple reviewers per pair, disagreement-filtering) before throwing more compute at the problem.

Over-training the tester: counterintuitively, a reward model that fits the training data too well is easier to hack. A slightly less accurate tester is a smoother target — PPO has fewer sharp local maxima to exploit. Stop at about 1 epoch on a large preference dataset; watch validation accuracy plateau, not climb.

Reward shaping invites more hacking: every rule you bolt on to the tester (length penalty, refusal detector, formatting filter) is a new surface for the policy to game. Each rule buys a week of stability and then becomes another loss term your policy has learned to route around.

Train an RM on HH-RLHF and audit it

Fine-tune a small reward model (Qwen2.5-0.5B or TinyLlama is fine) on Anthropic's Anthropic/hh-rlhf preference dataset for one epoch using the TRL RewardTrainer recipe above. Use a held-out test split.

Then audit your tester. Three things to check:

  • Accuracy. On the test set, what fraction of pairs does the tester rank correctly (chosen reward > rejected reward)? A decent reward model lands around 65-75%. Any lower and something is wrong; any higher and you may have label leakage.
  • Reward distribution. Plot a histogram of r(chosen) and r(rejected) side by side. They should overlap substantially — this is a noisy signal, not a clean classifier. The means should differ by roughly 0.5-1.0 reward units.
  • Length audit. Scatter-plot r(response) against len(response) on the test set. Compute the correlation. If it's above 0.3, your tester is a length detector wearing a preference-model costume.

Bonus: find the 10 test completions with the highest reward scores. Read them. Are they actually good, or are they long, hedged, and bullet-pointed? Welcome to reward hacking — you're seeing it in your static data before PPO ever amplifies it.

What to carry forward. A reward model is the taste tester — a fast neural critic that stands in for the slow panel of human reviewers. It converts pairwise preferences into a scalar via Bradley-Terry (a sigmoid of the reward gap, trained with cross-entropy on (prompt, chosen, rejected) triples), and architecturally it's your SFT transformer with the LM head swapped for a scalar head. Its output is a proxy for human preference, and the next stage will optimize against it hard. Proxies break under optimization pressure — which is exactly why every production RLHF pipeline relies on a KL penalty to keep the policy near the distribution where the tester was trained.

Next up — Direct Preference Optimization. We just spent a whole lesson training a separate model to be the tester. Reasonable question: do we need the tester at all? DPO's answer is a cheeky “not really.” It folds the reward-model objective and the policy-optimization objective into a single loss you can take a gradient on — no separate reward network, no PPO loop, no KL anchor bolted on after the fact. Same preference pairs, same Bradley-Terry math, one less moving part. We'll derive it from the same sigmoid you just met and watch it drop out of the RLHF stack almost for free.

References