LoRA
Low-rank adapters — fine-tune 0.1% of the parameters.
Picture the base model as a 700-page textbook. Every page is packed with equations the pretrained weights have memorised — grammar, world facts, the shape of code, the rhythm of dialogue. You want to fine-tune it on your own data. Fine — except fine-tuning in the textbook analogy means reprinting the entire book. New ink for every page, even the ones you had no quarrel with. For a 70B model that's roughly 280GB of optimizer state — fp16 parameters plus two fp32 AdamW moments for every weight. A single H100 has 80GB. A consumer RTX 4090 has 24GB. The math does not care what you want. Reprinting the book, on your hardware, is not happening.
There is a saner move. Leave the textbook alone. Paste a handful of tiny sticky notes in the margins that override a handful of the equations. At inference, the reader reads the book plus the stickies — and the stickies win wherever they sit. You've changed the effective contents of the book without touching a single printed page. That is the entire pitch behind parameter-efficient fine-tuning, and the cleanest version of it — the one that now runs inside basically every open model on HuggingFace — is LoRA, Low-Rank Adaptation, from Hu et al. in 2021.
One sentence: instead of updating a weight matrix W directly, learn a small low-rank correction that sits in the margin next to it. The book stays frozen. Ninety-nine-plus percent of the parameters never move. And you can fine-tune a 70B model on a single GPU.
This lesson derives the sticky, walks the parameter arithmetic, shows why picking which pages to annotate matters as much as how big each note is, and builds a LoRALinear module from scratch in three layers — pure NumPy, a hand-rolled PyTorch Module, then the one-liner you'd actually use via the peft library.
I reprint the book. Every page, every equation, fresh ink. For a 70B model that means holding 70B parameters in fp16 plus 70B first-moment and 70B second-moment estimators in fp32 for AdamW — roughly 280GB just to take one step with gradient descent. If you can't afford the printing press, you can't afford me. I am not sorry.
The observation that makes the sticky-note trick work is empirical. When you fine-tune a pretrained language model, the correction you'd apply to each weight matrix — call it ΔW — turns out to be very close to low-rank. The textbook has already learned most of what it needs during pretraining; fine-tuning is a small, structured override on top. Aghajanyan et al. (2020) called this the intrinsic dimensionality of fine-tuning and showed you could adapt a BERT model by tuning as few as 200 scalar parameters along the right direction. That's not a margin note. That's a Post-it.
So here's the LoRA decomposition — the sticky, opened up. A weight matrix W is d × d. A full-rank correction would also be d × d, d² trainable scalars per matrix, a fresh page of ink per matrix. Don't do that. Write the sticky as two skinny strips — one tall, one wide — whose product is the full-size correction:
W_effective = W₀ + B · A
───────── ─────────────
frozen book sticky note
(two strips)
where A ∈ ℝ^(r × d) ← one thin horizontal strip
B ∈ ℝ^(d × r) ← one thin vertical strip
r ≪ d typically r ∈ {4, 8, 16, 32, 64}W₀ is the original pretrained weight — the printed page. You freeze it, you never touch it. The only things that move during training are the two strips A and B. Their product B · A is a d × d matrix just like ΔW would have been — same shape as the page it overrides — but by construction it has rank at most r. You've swapped a full-page rewrite for a margin note whose information content is deliberately bottlenecked. The savings are enormous.
Drag the rank slider and watch the two strips grow. At r = d the sticky is the same size as the page — no savings, you've just reinvented full fine-tuning with extra steps. At r = 1 the sticky collapses to a single outer product — one column times one row — and the parameter count is 2d. Real LoRA sits in the middle: small enough to be cheap, wide enough to carry the override.
I am the nudge you'd have learned anyway. Your task is easy relative to what the textbook already knows — a correction, not a rewrite. I can fit that correction into a few million parameters instead of a few billion. At inference the reader sees book plus sticky — same shape, same behavior. You just paid 125× less to get here.
Let's do the arithmetic on a real page. A Llama-2 7B has a hidden size of d = 4096. One attention weight matrix in that model has:
reprint the page: d × d = 4096 × 4096 = 16,777,216 params
sticky (r = 16): 2 · r · d = 2 · 16 · 4096 = 131,072 params
─────────────
ratio: 16,777,216 / 131,072 = 128 × fewerOne hundred and twenty-eight times fewer trainable scalars per sticky. A 7B model has hundreds of weight matrices across its attention and MLP layers, and LoRA typically targets a subset of them. In practice you end up training something like 0.1% – 1% of the original parameter count. The optimizer state — the thing that actually kills you during full fine-tuning — shrinks by the same factor. Suddenly the 280GB beast fits in a consumer GPU.
Here's where it gets more interesting. You don't have to stick a note on every page — you choose which pages to annotate. Attention has four projections per layer (q_proj, k_proj, v_proj, o_proj) and the MLP has three more (gate, up, down in a SwiGLU block). Each target you mark costs parameters and adds override capacity. Toggle them below and watch the tradeoff.
The original LoRA paper only annotated q_proj and v_proj — query and value — and got nearly full fine-tuning quality. That convention stuck for years. More recent work (QLoRA, and the HuggingFace PEFT defaults) sticks a note on every linear layer in the transformer, MLP included. It costs more parameters but reliably picks up a couple of points on harder benchmarks. Rule of thumb: start with attention-only at r = 16; if the task is underfit, extend the stickies to the MLP before cranking rank higher.
I'm the width of the sticky. Set me to 4 and you get an aggressive override — fast, tiny notes, fine for style transfer or simple instruction-following. Crank me to 64 and I approach full-reprint quality at 2% of the cost. Most people leave me at 8 or 16 and never look back. Doubling me doubles the trainable strip area, but the quality curve flattens fast. More rank is not more better.
Three layers, one module. Start with the forward pass in pure NumPy so you can see every matrix multiply that reads the book and applies the sticky; then wrap it in a PyTorch nn.Module that replaces nn.Linear one-for-one; then swap the whole thing for peft, which sticks the notes for you.
import numpy as np
np.random.seed(0)
d, r, batch = 128, 8, 4
alpha = 16 # LoRA scaling hyperparameter
# The frozen book — pretrained weight we never reprint
W0 = np.random.randn(d, d) * 0.02
# The sticky note, written as two skinny strips
# A is Gaussian, B is zeros so the sticky starts at EXACTLY zero override
A = np.random.randn(r, d) * 0.01 # (r, d) — horizontal strip
B = np.zeros((d, r)) # (d, r) — vertical strip, zero-init
x = np.random.randn(batch, d) # inputs
# Plain book output: y₀ = x W₀ᵀ
# Book + sticky: y = x W₀ᵀ + (α/r) · x Aᵀ Bᵀ
y0 = x @ W0.T
sticky = (alpha / r) * (x @ A.T) @ B.T # the low-rank override
y = y0 + sticky
print("W0 output shape: ", y0.shape)
print("LoRA output shape: ", y.shape)
print(f"trainable params: A={A.shape}={A.size}, B={B.shape}={B.size} → {A.size + B.size} (vs full {d*d})")
print(f"scale α/r = {alpha/r} — the knob that controls how loud the sticky is")W0 output shape: (batch=4, d=128) LoRA output shape: (batch=4, d=128) trainable params: A=(8,128)=1024, B=(128,8)=1024 → 2048 (vs full 16384) scale α/r = 2.0 — the knob that controls how loud the sticky is
Three things to notice. A is Gaussian, B is zero. That init is on purpose — at step zero, B · A = 0, so the sticky is blank and the annotated model is bit-identical to the base book. You start from exactly the pretrained behavior and move away from it. Initialise both Gaussian and the sticky arrives pre-scribbled with random noise; the first forward pass corrupts the book's output and training spends hundreds of steps clawing back what you broke.
The scale factor α/r. This is the LoRA paper's α hyperparameter divided by rank. It keeps the effective volume of the sticky roughly constant as you change r, so you don't have to re-tune learning rate when you sweep rank. Most configs use α = 2r, which makes the scale a clean 2.
We compute x @ A.T @ B.T, never form B @ A directly. Forming the d × d product would defeat the whole point — you'd materialise the full-size correction in memory and have accidentally reprinted the page. Left-to-right through the low-rank bottleneck keeps memory at O(batch · r). The strip stays a strip.
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
"""Drop-in replacement for nn.Linear: frozen book + trainable sticky."""
def __init__(self, in_features, out_features, r=8, alpha=16, bias=True):
super().__init__()
self.r, self.alpha, self.scale = r, alpha, alpha / r
# The book — frozen page, never reprinted
self.linear = nn.Linear(in_features, out_features, bias=bias)
for p in self.linear.parameters():
p.requires_grad = False # FREEZE — this is the point
# The sticky — two trainable strips
self.A = nn.Parameter(torch.empty(r, in_features))
self.B = nn.Parameter(torch.zeros(out_features, r)) # zero-init B
nn.init.kaiming_uniform_(self.A, a=5**0.5) # Gaussian-ish A
def forward(self, x):
return self.linear(x) + self.scale * (x @ self.A.T) @ self.B.T
# ---- use it ----
layer = LoRALinear(4096, 4096, r=16, alpha=32)
trainable = sum(p.numel() for p in layer.parameters() if p.requires_grad)
total = sum(p.numel() for p in layer.parameters())
print(f"LoRALinear(d_in=4096, d_out=4096, r=16)")
print(f" trainable: {trainable} / {total} ({100*trainable/total:.2f}%)")
# one fake training step
x = torch.randn(8, 4096)
target = torch.randn(8, 4096)
opt = torch.optim.AdamW([p for p in layer.parameters() if p.requires_grad], lr=1e-3)
loss0 = ((layer(x) - target)**2).mean(); loss0.backward(); opt.step()
with torch.no_grad():
loss1 = ((layer(x) - target)**2).mean()
print(f"loss before step: {loss0.item():.4f} | after: {loss1.item():.4f}")LoRALinear(d_in=4096, d_out=4096, r=16) trainable: 131072 / 16908288 (0.78%) loss before step: 1.2418 | after: 1.2006
W0 = np.random.randn(d, d) * 0.02←→self.linear = nn.Linear(...); freeze()— the frozen book — requires_grad=False on every page
A = np.random.randn(r, d), B = np.zeros←→nn.Parameter + kaiming_uniform / zeros— same two strips, now tracked by autograd
y = x @ W0.T + (α/r) * x @ A.T @ B.T←→self.linear(x) + self.scale * (x @ A.T) @ B.T— book + sticky — parens are load-bearing (rank bottleneck)
In real life you don't hand-write the sticky holder. HuggingFace's peft library walks the model graph, pattern-matches module names, and swaps in LoRA wrappers at runtime — automated sticky placement. You declare the config, call get_peft_model, train normally.
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank — width of each sticky
lora_alpha=32, # α = 2r keeps scale = 2
lora_dropout=0.05, # small dropout on the sticky path
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
bias="none", # don't also train biases
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062
# From here, train with any Trainer / custom loop. At the end,
# model.save_pretrained("adapter/") writes ONLY the A and B strips —
# the book stays on disk once, shared across any number of stickies.trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062
LoRALinear(q_proj); LoRALinear(v_proj); ...←→LoraConfig(target_modules=[...])— peft walks the graph and sticks a note on each matched page
for p in base.parameters(): p.requires_grad=False←→get_peft_model(model, config)— freezing the book + placing the stickies in one call
torch.save({"A": ..., "B": ...})←→model.save_pretrained("adapter/")— the stickies ship as a ~20MB file — the book stays home
r = 0 or r = d: both are pathological. r = 0 means no sticky at all — you're just reading the book. r = d means the sticky is the size of the page — no savings, you've just added two strips whose product equals a full reprint; use full fine-tune instead.
Initialising both A and B as Gaussian: one of the most common bugs in hand-rolled implementations. If B is not zero, the sticky arrives pre-scribbled with random noise, the first forward pass corrupts the book's output, the loss spikes, and training is unstable for hundreds of steps. Zero-init B means a blank sticky — you start exactly at the pretrained model.
Shipping without merging: keeping W₀ and B · A separate at inference means two matmuls per layer — the reader opens the book and then reads the sticky. Roughly 1.2–1.5× latency for no reason. Always merge_and_unload() before production-serving — unless you genuinely need per-request sticky swapping.
Same rank across every page: the default, but not always optimal. Lower layers (early in the network) often need less override than upper layers for downstream tasks; recent work uses wider r stickies in the top third of the book. If you're bottlenecked on quality and have already tried larger uniform rank, try non-uniform.
Grab a small instruction dataset (tatsu-lab/alpaca, 52k examples, is the canonical toy benchmark). Load Llama-2-7B in bf16 with peft and fine-tune with LoRA at r ∈ {4, 8, 16, 32}, keeping everything else fixed (α = 2r, learning rate 2e-4, 1 epoch, stickies on q and v only).
Log the final eval loss for each. Plot loss vs r. You should see a clear elbow somewhere between 8 and 16 — past the elbow, doubling the sticky width barely moves loss. That elbow is the intrinsic dimensionality of your task showing up in your own training run.
Bonus: at your best r, extend target_modules to include the MLP (gate_proj, up_proj, down_proj) and rerun. How much does sticking every page buy you over attention-only, and is the extra parameter cost worth it?
What to carry forward. Fine-tuning updates are empirically low-rank, so we don't have to reprint the book. LoRA freezes the base, learns two skinny strips A and B whose product is the margin note, and captures 95–100% of full fine-tuning quality at well under 1% of the parameter cost. The two knobs you pick are the sticky's width (rank) and which pages get annotated (target modules). Initialise A Gaussian and B zero so the sticky starts blank, use scale α/r, press the sticky into the page before production unless you need hot-swap.
Next up — qlora. LoRA shrank the optimizer state and the gradient storage — the sticky is tiny. But the book itself — W₀, the frozen base — is still sitting in memory in fp16, and for a 70B model that's 140GB of printed ink. QLoRA (Dettmers 2023) does one more thing: quantise the frozen book down to 4 bits per weight — compress the page without losing what it says. The base now fits in ~35GB, the LoRA stickies train on top, and you can fine-tune Llama-70B on a single 48GB GPU. It is, unreasonably, the state of the art for accessible fine-tuning.
- [01]Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen · arXiv 2021 — the original LoRA paper
- [02]Aghajanyan, Zettlemoyer, Gupta · ACL 2021 — why fine-tuning updates are low-rank
- [03]Dettmers, Pagnoni, Holtzman, Zettlemoyer · NeurIPS 2023
- [04]HuggingFace · library — LoraConfig, get_peft_model, merge_and_unload
- [05]Sheng, Cao, Li, Zhu, Zheng, Gonzalez, Stoica · MLSys 2024