From a base model to an aligned, instruction-following assistant
Turn a base model into an instruction-follower.
Low-rank adapters — fine-tune 0.1% of the parameters.
LoRA on 4-bit weights — fine-tune a 70B on a single GPU.
Train a preference model from human pairwise comparisons.
Policy optimization against a learned reward model.
RLHF without a separate reward model — the elegant alternative.