Actor-Critic
Combine policy learning with a value baseline.
Picture an actor on stage, performing a new play for the first time. The house lights are down, the audience is silent, and the actor has no idea whether the scene they're in the middle of is landing or dying. They commit to the line, the blocking, the pause. Only at the very end of the night — curtain down, applause or polite coughing — do they find out if the performance worked.
That is REINFORCE. The actor improvises an entire episode, reads the total reward at the end, and tries to retro-engineer which scenes were good and which were embarrassing. One episode you stumble into a standing ovation, the next you fall off a cliff, and the policy gradient swings wildly to match. It's the high-variance problem, and every performer knows it: you can't adjust mid-performance if the only feedback is at opening night.
Textbook fix: subtract a baseline from the return. The gradient stays unbiased, the variance drops. Fine. But what baseline? The best one, the one the math actually wants, is the state-value value function V(s): the expected return from state s under the current policy. Subtract that from each episode's return and the update only moves when the action was better or worse than average from the state the actor happened to be in. Which is the whole point of grading a scene.
You don't know V(s). So you hire someone to figure it out while the actor is still on stage. That's actor-critic: a playwright and a theater critic sharing one rehearsal room. Train a second network — the critic, V_φ — whose entire job is to evaluate the state the actor is standing in, and use its review as the baseline for the actor, π_θ. Actor performs the scene. Critic grades it in real time. Actor adjusts on the fly. Critic calibrates against what actually happens next. They rehearse together. That pairing is so useful that “actor-critic” stopped being an optional add-on and became the default shape of every modern policy-gradient algorithm — A2C, A3C, PPO, SAC, TRPO. Same bones underneath.
Actor (policy): θ ← θ + α_θ · ∇log π_θ(a|s) · A(s, a)
Critic (value): φ ← φ + α_φ · ∇V_φ(s) · (r + γ · V_φ(s') − V_φ(s))
╰──────── δ, TD error ────────╯
A(s, a) ≈ δ (advantage ≈ TD error in the 1-step case)Read those two lines like a script with stage directions. Both updates are driven by the same signal — r + γV_φ(s') − V_φ(s), the TD error. The critic reads it as a prediction error (“I said this scene was worth V(s); the real performance says it was worth r + γV(s'); here's my miscalibration”). The actor reads the same number as an advantage (“this line beat the critic's expectation by this much — do more of that”). One scalar, two readings, two gradient steps on two networks. That's the whole trick.
the "surprise": reward plus discounted bootstrap minus our old estimate. drives BOTH updates.
The pipeline. State of the stage goes into both heads. Actor emits π(a|s), a distribution over next actions, samples one, and delivers the line to the environment. The environment returns (r, s') — the audience's instantaneous reaction and the shape of the next scene. Critic looks at V(s) and V(s'); the advantage A = r + γV(s') − V(s) is just “how much better was the scene than my forecast.” Then the same A flows into two places: scaled by ∇log π(a|s) it updates the actor, and as a plain scalar target it updates the critic. One forward pass, one environment step, two gradient steps. Repeat until opening night.
I perform the scene. I don't know whether the beat landed until the Critic in the back of the house tells me. My gradient is whatever the Critic hands me, multiplied by the log-probability of the line I just delivered. If the review comes back better than expected, I lean into that choice. If it comes back worse, I lean out. I don't have opinions about the set. I only have opinions about my actions, conditioned on whichever stage I find myself on.
Unpack the critic's review. What we really want is A(s, a) = Q(s, a) − V(s) — how much better was this particular action than the average action from the same scene. We can't compute Q(s, a) directly without replaying the whole performance, but we can estimate it with a one-step bootstrap: take the immediate reaction plus the discounted value the critic assigns to the next stage. Plug in and simplify.
A(s, a) = Q(s, a) − V(s)
≈ ( r + γ · V_φ(s') ) − V_φ(s) Q estimated by 1-step TD
= r + γ · V_φ(s') − V_φ(s) =: δ, the TD errorThis is the review in mathematical form: that scene was better than average by δ. Low-variance (one step of the real performance, most of the signal comes from the critic's forecast) but biased — V_φ is only an approximation and any miscalibration in it contaminates the grade. At the other extreme, the pure Monte-Carlo advantage waits for the curtain and uses the full episode return instead of r + γV(s') — unbiased, but noisy. The obvious question: can we interpolate between “grade the scene right now” and “wait for the reviews in the morning paper”?
Yes. That's Generalized Advantage Estimation. Schulman et al. (2015) wrote down a clean geometric sum over multi-step TD errors with a decay parameter λ.
δ_t = r_t + γ · V_φ(s_{t+1}) − V_φ(s_t) single-step TD error
A_t^GAE(γ, λ) = Σ (γλ)^k · δ_{t+k}
k=0
= δ_t + γλ · δ_{t+1} + (γλ)² · δ_{t+2} + …
λ = 0 → A_t = δ_t pure TD, low var, biased
λ = 1 → A_t = G_t − V(s_t) pure Monte Carlo, high var
λ = 0.95 ≈ most PPO / A2C code sweet spotλ is the dial between “trust the critic” and “trust the audience.” At λ = 0 the actor takes the critic's one-step grade at face value after every scene (biased but smooth). At λ = 1 the actor ignores mid-performance reviews and reads the morning paper (unbiased but noisy). In practice λ ≈ 0.95 is what everybody ships — most of the variance reduction, most of the time, with bias kept tolerable.
Two learning curves on CartPole. Vanilla REINFORCE on the left — the actor performing blind, total reward climbing but with spikes and reversals every handful of episodes, every scene graded only at the end. A2C on the right — same environment, same policy architecture, but now there's a critic in the room delivering real-time reviews. Smoother. Faster. The gap between the two curves is variance reduction from a competent critic, full stop. No new policy-gradient theorem, no fancier optimizer. Just: hire the critic, listen to the review, move on.
I don't perform. I don't pick lines. I just sit in the back of the house and put a number on the stage. My loss is boring — MSE between my guess and a bootstrapped target. But every advantage the Actor sees flows through me, so if my reviews are sharp, the Actor's gradients are clean. If I'm miscalibrated, the Actor rehearses lies. I'm the quieter of the two, but I set the signal-to-noise ratio for the whole production.
This is the part that feels paradoxical the first time you see it: both networks are learning at the same time. The actor is adjusting its performance based on reviews from a critic who is still figuring out how to review. The critic is calibrating its reviews against a performance that keeps changing. Neither one ever converges to a fixed target — they're both moving, and each one's movement reshapes the other's gradient. It should spiral into nonsense, and in some implementations (wrong learning rates, no detach, terminal-state bugs) it does. In practice, when you wire it carefully, the two processes co-evolve: the actor's performances give the critic cleaner targets, the critic's sharpening reviews give the actor cleaner gradients, and the whole system settles into a working rehearsal room. That dance — two networks training each other on the same trajectory — is the signature of every actor-critic method.
Three implementations of the same rehearsal on CartPole. Pure Python first — two tiny linear heads, actor and critic, trained with manual gradients so you can see every term. Then NumPy with GAE. Then a full PyTorch A2C that would pass for a minimal library implementation.
import random, math
import gymnasium as gym
env = gym.make("CartPole-v1")
obs_dim, n_act, gamma = 4, 2, 0.99
# actor: logits over 2 actions; critic: scalar V(s). both linear.
W_pi = [[0.0]*obs_dim for _ in range(n_act)]
W_v = [0.0]*obs_dim
lr_pi, lr_v = 1e-2, 5e-3
def softmax(z):
m = max(z); e = [math.exp(v - m) for v in z]; s = sum(e)
return [x / s for x in e]
def policy(s):
logits = [sum(W_pi[a][i]*s[i] for i in range(obs_dim)) for a in range(n_act)]
return softmax(logits)
def value(s):
return sum(W_v[i]*s[i] for i in range(obs_dim))
for ep in range(200):
s, _ = env.reset()
done, G = False, 0.0
while not done:
p = policy(s)
a = 0 if random.random() < p[0] else 1
s2, r, term, trunc, _ = env.step(a); done = term or trunc
G += r
# one-step advantage: A = r + γV(s') − V(s) (zero the bootstrap if terminal)
v_s, v_s2 = value(s), 0.0 if done else value(s2)
adv = r + gamma * v_s2 - v_s
# critic: φ ← φ + α · adv · ∇V(s) (∇V(s) = s, since V is linear)
for i in range(obs_dim):
W_v[i] += lr_v * adv * s[i]
# actor: θ ← θ + α · adv · ∇log π(a|s) (softmax grad: (1_a − p) · s)
for act in range(n_act):
grad = ((1.0 if act == a else 0.0) - p[act])
for i in range(obs_dim):
W_pi[act][i] += lr_pi * adv * grad * s[i]
s = s2
if (ep + 1) % 50 == 0:
print(f"ep {ep+1:3d} | return {G:6.1f}")ep 50 | return 42.3 ep 100 | return 87.6 ep 150 | return 162.4 ep 200 | return 198.1
Even at this size it learns. The actor is two lines of softmax; the critic is a dot product. They share a state and a scene-by-scene review and that's enough. Now NumPy, with the upgrade to GAE — collect a rollout, compute the λ-weighted advantage, batch the update. This is the shape of every A2C/PPO training loop you'll ever read.
import numpy as np
import gymnasium as gym
env = gym.make("CartPole-v1")
OBS, ACT, GAMMA, LAM = 4, 2, 0.99, 0.95
W_pi = np.zeros((ACT, OBS))
W_v = np.zeros(OBS)
def softmax(z):
z = z - z.max(axis=-1, keepdims=True)
e = np.exp(z); return e / e.sum(axis=-1, keepdims=True)
def compute_gae(rewards, values, dones, last_val):
"""δ_t = r_t + γV_{t+1}(1-d) − V_t ; A_t = δ_t + γλ(1-d)·A_{t+1}"""
T = len(rewards)
adv = np.zeros(T); gae = 0.0
values = np.append(values, last_val) # pad with bootstrap
for t in reversed(range(T)):
nonterm = 1.0 - dones[t]
delta = rewards[t] + GAMMA * values[t+1] * nonterm - values[t]
gae = delta + GAMMA * LAM * nonterm * gae
adv[t] = gae
returns = adv + values[:-1] # target for critic
return adv, returns
# one update cycle after collecting an N-step rollout
def update(states, actions, advs, returns, lr_pi=1e-2, lr_v=5e-3):
global W_pi, W_v
# critic: MSE against bootstrapped returns
preds = states @ W_v
W_v += lr_v * ((returns - preds)[:, None] * states).mean(0)
# actor: policy gradient with GAE advantage (normalize — the field-standard trick)
advs = (advs - advs.mean()) / (advs.std() + 1e-8)
probs = softmax(states @ W_pi.T) # (N, ACT)
onehot = np.eye(ACT)[actions]
grad_logp = onehot - probs # ∇log π for softmax
W_pi += lr_pi * (advs[:, None, None] * grad_logp[:, :, None] * states[:, None, :]).mean(0)Two things to notice. compute_gae runs backward through the rollout — each A_t depends on A_{t+1}, so you sweep right-to-left and accumulate. It's the critic reading the performance in reverse, from the last scene outward, so each scene's review can borrow from the future. And we normalize advantages before the actor update: (A − mean) / std. This isn't in the math — it's an empirical stabilizer that the entire field ships. Don't fight it.
import torch, torch.nn as nn, torch.nn.functional as F
import gymnasium as gym
class ActorCritic(nn.Module):
def __init__(self, obs_dim, n_act, hidden=64):
super().__init__()
self.shared = nn.Sequential(nn.Linear(obs_dim, hidden), nn.Tanh(),
nn.Linear(hidden, hidden), nn.Tanh())
self.actor = nn.Linear(hidden, n_act) # logits
self.critic = nn.Linear(hidden, 1) # V(s)
def forward(self, s):
h = self.shared(s)
return self.actor(h), self.critic(h).squeeze(-1)
env = gym.make("CartPole-v1")
net = ActorCritic(env.observation_space.shape[0], env.action_space.n)
opt = torch.optim.Adam(net.parameters(), lr=3e-4)
GAMMA, LAM, N_STEPS = 0.99, 0.95, 2048
def rollout():
obs, _ = env.reset()
S, A, R, D, LP, V = [], [], [], [], [], []
for _ in range(N_STEPS):
s_t = torch.as_tensor(obs, dtype=torch.float32)
logits, v = net(s_t)
dist = torch.distributions.Categorical(logits=logits)
a = dist.sample()
nxt, r, term, trunc, _ = env.step(a.item()); done = term or trunc
S.append(s_t); A.append(a); R.append(r); D.append(float(done))
LP.append(dist.log_prob(a)); V.append(v) # both tracked
obs = env.reset()[0] if done else nxt
_, last_v = net(torch.as_tensor(obs, dtype=torch.float32))
return S, A, R, D, LP, V, last_v
def compute_gae(R, D, V, last_v):
adv, gae = [0.0]*len(R), 0.0
V = V + [last_v]
for t in reversed(range(len(R))):
nt = 1.0 - D[t]
delta = R[t] + GAMMA * V[t+1].detach() * nt - V[t].detach() # detach! critic is a
gae = delta + GAMMA * LAM * nt * gae # target here, not a fn
adv[t] = gae
return torch.stack([torch.tensor(a) for a in adv]).float()
for it in range(500):
S, A, R, D, LP, V, last_v = rollout()
adv = compute_gae(R, D, V, last_v)
returns = adv + torch.stack(V).detach() # critic target
adv = (adv - adv.mean()) / (adv.std() + 1e-8)
log_probs = torch.stack(LP)
values = torch.stack(V)
actor_loss = -(log_probs * adv).mean() # gradient ascent → −
critic_loss = F.mse_loss(values, returns)
entropy = -torch.stack([
torch.distributions.Categorical(logits=net(s)[0]).entropy() for s in S
]).mean() # bonus: explore
loss = actor_loss + 0.5 * critic_loss + 0.01 * entropy
opt.zero_grad(); loss.backward(); opt.step()manual log-softmax gradient: (1_a − p) · s←→dist.log_prob(a); loss.backward()— autograd handles the policy gradient — you only write the loss
hand-rolled V = w·s←→self.critic = nn.Linear(hidden, 1)— critic is just a second head on a shared trunk — one network, two outputs
adv = r + γV(s') − V(s)←→compute_gae(R, D, V, last_v)— swap one-step TD for λ-weighted multi-step — same advantage role
separate lr_pi and lr_v←→actor_loss + 0.5·critic_loss + 0.01·entropy— one optimizer, weighted sum of three losses — standard A2C recipe
Using the TD target for the actor: the actor's gradient multiplies ∇log π by the advantage, not by the raw target r + γV(s'). If you pass in the target, every action at a good stage looks good — the policy gradient becomes “do more of whatever you did in scenes with high value,” which is noise. Subtract V(s). Always.
Forgetting to detach the critic for the actor: when computing A = r + γV(s') − V(s) and feeding A into the actor loss, you must .detach() it (or its components). Otherwise the actor loss's backward pass flows into the critic too, and you're training the critic to make the advantage small — the opposite of what you want. From the actor's perspective the critic is a review, not a learnable quantity.
Wrong sign: gradient ascent on log π(a|s) · A is gradient descent on −log π(a|s) · A. In PyTorch you minimize losses, so the actor loss has a minus. Flip the sign accidentally and your actor rehearses how to lose.
Bootstrapping past a terminal state: at the end of an episode, V(s') is conceptually zero — the curtain came down, there is no next scene. If you forget the (1 − done) mask in r + γ(1−d)V(s') − V(s), you'll feed in the value of the new episode's opening scene as if it belonged to the previous performance. Invisible bug, catastrophic in practice.
Start from a working REINFORCE implementation on CartPole-v1 (you have one from the last lesson). Add a second head to your network for V(s) — the critic's seat in the house — and a value-loss term to the optimizer. Use a one-step TD advantage: A = r + γV(s')(1−d) − V(s), detached from the critic graph when it feeds into the actor loss.
Plot REINFORCE vs A2C on the same axes, averaged over 5 seeds. You should see A2C solve CartPole (reward ≥ 195) in roughly half the episodes, with visibly lower variance between seeds — the actor getting real-time reviews instead of waiting for the curtain.
Then upgrade to GAE with λ = 0.95: collect a 2048-step rollout, compute advantages backward through the trajectory, normalize them, and apply one big batched update. Compare three curves now — REINFORCE, A2C (1-step), A2C + GAE. The spacing between them is the whole story of variance reduction.
Bonus: set λ = 0 and λ = 1 explicitly and re-run. You should see λ = 0 learn fastest but hit a lower ceiling (critic bias), λ = 1 match Monte-Carlo noise, and λ = 0.95 comfortably win.
What to carry forward. Every serious policy-gradient algorithm after REINFORCE has a critic. The critic's job is to be a good reviewer; the actor's job is to move in the direction the critic points. The advantage — r + γV(s') − V(s) — is the shared scalar they negotiate over, and GAE gives you a knob (λ) to trade bias against variance in how aggressively you trust the review. Normalize your advantages. Detach your critic outputs when they feed the actor. Mask terminal bootstraps. Those three lines of discipline separate a working production from a silently broken one.
Next up — Proximal Policy Optimization. A2C has one remaining fragility: a single bad batch can blow up the policy, because there's nothing stopping the actor from making a huge step away from the performance the critic was grading. The critic's reviews become stale instantly; the actor wanders into scenes nobody has evaluated; the training spirals. PPO fixes that with a clipped importance ratio that lets you safely take multiple gradient steps on the same batch without drifting too far from the old policy. It's the algorithm behind ChatGPT's RLHF, the default baseline in robotics, and — structurally — nothing more than A2C plus a min(ratio · A, clip(ratio) · A). You already have the bones. After PPO, the curriculum leaves learning behind and turns to inference and serving: once you have a trained model, how do you ship it cheaply and fast? Opens with Quantization Basics — the first of the (almost) free lunches.
- [01]Konda, Tsitsiklis · NeurIPS 2000 — the original formalization
- [02]Mnih et al. · ICML 2016 — the A3C paper
- [03]Schulman, Moritz, Levine, Jordan, Abbeel · ICLR 2016 — GAE
- [04]Sutton, Barto · Chapter 13 — Policy Gradient Methods