ml ›section 13 of 14

Reinforcement Learning

Learn from reward signals — the algorithms behind AlphaGo and RLHF

6 lessons·2medium4hard

Lessons

in order

States, actions, rewards, transitions — the RL contract.

Learn a value function from experience, one update at a time.

Optimize the policy directly via gradient ascent on expected reward.

The cleanest policy-gradient algorithm — and its variance problem.

Combine policy learning with a value baseline.

The stable RL algorithm behind RLHF.