Learn from reward signals — the algorithms behind AlphaGo and RLHF
States, actions, rewards, transitions — the RL contract.
Learn a value function from experience, one update at a time.
Optimize the policy directly via gradient ascent on expected reward.
The cleanest policy-gradient algorithm — and its variance problem.
Combine policy learning with a value baseline.
The stable RL algorithm behind RLHF.