PyTorch Basics
Tensors, autograd, modules — the mental model.
You've already done the hard version. You wrote a multi-layer perceptron in pure NumPy, derived the backward pass by hand, then stacked layers until the chain rule through an arbitrarily deep net turned into a bookkeeping nightmare. Every new layer meant a new Jacobian to trace by hand. The final 40% of the code was index arithmetic with no mathematical content whatsoever.
Now meet the scribe.
PyTorch ships a scribe that sits next to you while you compute. Every operation you perform on a tensor during the forward pass — a matmul, an add, a squared error — the scribe quietly writes down in a ledger. When you call .backward(), the scribe flips the ledger over and walks it in reverse, handing you every gradient you need. You never write the backward pass again. You write the forward pass like you're computing something ordinary, and the scribe watches.
That's the library. Three moving parts sit around the scribe: the tensor (special ink that leaves a faint record), autograd (the scribe itself — the ledger and the reverse walk), and the module system (pre-fabricated chunks of operations the scribe logs as one entry). Then an optimizer — the person who reads the gradients and turns the dials. Understand those four and you understand 95% of what you'll ever need from any deep-learning framework.
Start with the ink. A tensor is an n-dimensional array — a generalization of a scalar (0-D), vector (1-D), matrix (2-D) into arbitrarily many dimensions. Every piece of data in a PyTorch program — input images, model weights, loss values, gradients — is a tensor. The API is almost identical to NumPy, with four additions that matter:
- device — a tensor can live on CPU or GPU. Move it with
.to("cuda"). This is where the scribe sits; keep all your tensors on the same device or the scribe has to get up and walk. - dtype — integer, float32, float16, bfloat16, int8 for quantization. Picking the right dtype can halve your memory and double your throughput.
- requires_grad — flip this to
Trueand the scribe starts recording every op you apply. Flip it off and the ink is plain. - grad_fn — the entry in the ledger. An opaque handle pointing back to the op that produced this tensor, used during the reverse walk.
Reading shapes first is the single most valuable debugging skill in this business. Cycle through the ops below and notice how the output shape falls out of the input shapes before any arithmetic happens.
c = a @ b
| 1 | 2 |
| 3 | 4 |
| 5 | 6 |
| 10 | 20 | 30 |
| 40 | 50 | 60 |
| 90 | 120 | 150 |
| 190 | 260 | 330 |
| 290 | 400 | 510 |
I am an n-dimensional array with an attitude. I know which device I live on (CPU by default, CUDA if you ask), what dtype I am (float32 by default, half or bfloat if memory's tight), and whether I am a leaf or the product of an op. If you fliprequires_grad=Trueon me, I will also silently record a trail of every op you apply — so that later, when you callbackward(), I can retrace my steps.
That trail is the ledger — in the documentation it's called the computation graph. Every time you apply an op to a tensor with requires_grad=True, PyTorch allocates a new tensor for the result and attaches a grad_fn — the scribe's entry for that op, carrying the exact local derivative. Stacking ops stacks grad_fns. One forward pass, one ledger, page by page.
Then you call loss.backward(). The scribe walks the ledger in reverse — chain rule, applied to every parameter in the graph, in linear time. This is the same multi-layer backprop algorithm you wrote by hand last section, with two differences: the scribe did the bookkeeping, and the scribe doesn't get tired.
Pick an expression below. Step through the forward pass — each node in the graph fills in with its numeric value as the scribe takes dictation. Then step through the backward pass — each node gets a pink ∂L bubble as the scribe hands back the gradient. Your job is to write the forward expression. The scribe writes the backward.
x = torch.tensor(2., requires_grad=True) y = torch.tensor(3., requires_grad=True) z = torch.tensor(4., requires_grad=True) L = (x + y) * z L.backward()
.backward() can only be called once. By default, autograd frees the computation graph after one backward pass to save memory. Calling backward again crashes. If you need to differentiate through the same graph twice, pass retain_graph=True — but 99% of the time you actually want a fresh forward pass.
Inplace ops break autograd. x.add_(1) (with the underscore) modifies the tensor in place. If that tensor was part of a computation graph, autograd may refuse to compute gradients or silently give wrong ones. Prefer x = x + 1 unless you know why you want inplace.
.detach() breaks the graph. Use it deliberately when you want to stop gradient flow: moving averages, reward baselines, fixed targets. The common bug is calling .detach() where you didn't mean to — suddenly your upstream weights are not getting gradients.
Wrap eval-mode code in with torch.no_grad(): to skip graph construction entirely — roughly 2× faster inference and you won't accidentally backprop through your test set.
The scribe will log every little op you write — but real networks have thousands of ops, and you don't want to re-assemble a linear layer out of a matmul and an add every time. Enter nn.Module: a pre-fabricated chunk of operations the scribe logs as one entry. A container that owns tensors, knows which ones are learnable parameters, and defines a forward method. You compose them by nesting. A Linear is a Module. A Sequential of Linears is a Module. A custom class you wrote with five sub-modules is also a Module. Every one offers .parameters(), .to(device), .state_dict(), and .train()/.eval() — the standard API.
The optimizer is the third character. The scribe hands you gradients; the optimizer reads them and adjusts the weights. A single torch.optim.SGD(model.parameters(), lr=0.1) is the packaged version of the update rule from gradient descent. Adam is the same loop with a little more bookkeeping. Either way, three roles: scribe records, scribe walks backward, optimizer steps.
Put them together and every PyTorch training script in existence reduces to this five-line dance:
optimizer.zero_grad() # clear the .grad buffer from last step yhat = model(x) # forward: build the graph, get predictions loss = criterion(yhat, y) # scalar loss — graph extends to this node loss.backward() # chain-rule walk — fills every .grad optimizer.step() # θ ← θ − α · ∇L for every parameter
Click through it line by line below. Watch each piece of tensor state change as the line executes. This is the same loop you'll use to train GPT-5. What varies across projects is the model, the data, the loss, the optimizer — the loop itself is invariant.
A complete training script, start to finish. One-variable linear regression trained by backprop. The number of lines of boilerplate is genuinely minimal — and every line you write is a line the scribe is watching.
import torch
import torch.nn as nn
# 1. Data — two vectors with a known linear relationship
torch.manual_seed(0)
x = torch.linspace(-2, 2, 100).unsqueeze(-1) # (100, 1)
y = 2 * x + 0.05 * torch.randn_like(x) # y ≈ 2x + noise
# 2. Model — one neuron, one weight, one bias
model = nn.Linear(in_features=1, out_features=1)
# 3. Loss + optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
# 4. The five-line loop, repeated.
for step in range(250):
optimizer.zero_grad()
yhat = model(x)
loss = criterion(yhat, y)
loss.backward()
optimizer.step()
if step % 100 == 0:
print(f"step {step:3d}: loss = {loss.item():.4f}")
print(f"learned slope: {model.weight.item():.4f} "
f"intercept: {model.bias.item():.4f}")step 0: loss = 8.1326 step 100: loss = 0.0123 step 200: loss = 0.0019 learned slope: 1.9912 intercept: 0.0451
nn.Linear(1, 1)←→weight + bias, both requires_grad=True— Module handles parameter registration automatically
model.parameters()←→a generator over all learnable tensors— what the optimizer iterates over
loss.backward()←→fills weight.grad and bias.grad— autograd walks the graph — same recurrence as last section
optimizer.step()←→weight -= lr * weight.grad ; bias -= lr * bias.grad— SGD in four lines; Adam is eight; the loop code is the same
Two short additions turn the above into a production script. Move the scribe to a GPU:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = x.to(device)
y = y.to(device)
model = model.to(device)
# The training loop is unchanged — tensors already know which device they live on.
for step in range(250):
optimizer.zero_grad()
yhat = model(x)
loss = criterion(yhat, y)
loss.backward()
optimizer.step()Every tensor moves to the GPU; the loop is byte-for-byte identical. The scribe doesn't care where the ink is — it just cares that everything stays together. Mix a CPU tensor with a CUDA one and you'll get a polite runtime error telling you to pick a side.
Mixed precision — use fp16 for speed, fp32 for stability:
from torch.amp import autocast, GradScaler
scaler = GradScaler()
for step in range(250):
optimizer.zero_grad()
with autocast(device_type="cuda", dtype=torch.float16):
yhat = model(x) # forward in fp16
loss = criterion(yhat, y)
scaler.scale(loss).backward() # scale the loss up so fp16 grads don't underflow
scaler.step(optimizer) # unscale + step
scaler.update() # adjust the scale factorSubclass nn.Module to implement a two-layer MLP (16 hidden units, ReLU) without using nn.Sequential. Register the two linear layers in __init__, implement forward, then train it on y = sin(x) sampled on [-π, π]. Use the five-line loop. Plot the fit after 1000 steps.
Bonus: swap SGD for Adam (torch.optim.Adam, default lr 1e-3) and watch it converge roughly 3× faster. That's the single most popular optimizer upgrade in practice.
What to carry forward. The scribe is the whole point of PyTorch. Tensors are the ink it records. Autograd is the ledger and the reverse walk. nn.Module is a pre-fabricated entry. The optimizer is the person who reads the gradients and turns the dials. Every training script is a five-line loop: zero, forward, loss, backward, step. Debug shape-first, value-second. Use .to(device) to move to GPU and autocast for free fp16 speedups.
Next up — Layer Normalization. There is one tool the scribe doesn't give you for free: a way to keep activation distributions stable as they flow through a deep net. Without it, activations drift, gradients explode, and early layers quietly stop learning. Layer normalization is the fix baked into every modern transformer. You'll derive it, visualize the distribution it enforces, and see why the scribe needed help.
- [01]PyTorch core team · pytorch.org
- [02]Edward Z. Yang · blog post, 2019
- [03]Paszke et al. · NeurIPS 2017 AutoDiff Workshop
- [04]Zhang, Lipton, Li, Smola · d2l.ai