ml ›section 10 of 14

Fine-Tuning & RLHF

From a base model to an aligned, instruction-following assistant

6 lessons·1medium5hard

Lessons

in order

Turn a base model into an instruction-follower.

Low-rank adapters — fine-tune 0.1% of the parameters.

LoRA on 4-bit weights — fine-tune a 70B on a single GPU.

Train a preference model from human pairwise comparisons.

Policy optimization against a learned reward model.

RLHF without a separate reward model — the elegant alternative.