ResNet & Skip Connections
Residuals, identity paths, and why depth stopped hurting.
Here is a sentence that sounds like it ought to be free: a deeper network is at least as good as a shallower one. Take your working 20-layer conv net. Paste 14 more layers on top. In the absolute worst case, those extra layers learn the identity function and you reproduce the 20-layer result exactly. Best case, the new layers do something useful. Either way, deeper ≥ shallower. This is just arithmetic.
In 2015, He, Zhang, Ren, and Sun ran the experiment and the arithmetic lost. Deeper nets were worse. And not worse on the test set — worse on the training set, which rules out overfitting as the culprit. The 56-layer net could not even fit the data that its 20-layer twin fit easily. The deeper network had strictly more capacity and strictly worse results. The model was cooked, and no amount of extra epochs unbaked it.
Their fix is one line of arithmetic — y = F(x) + x — and it is one of the most consequential one-liners in the history of deep learning. Every transformer you've heard of uses it. Every vision model past 2016 uses it. Diffusion, U-Nets, Llama, Claude, the thing that autocompletes your code — all of them are built on this addition. The rest of this lesson is why that line works and why nothing trained past a dozen layers before it existed.
Let's be precise about what He saw. Two plain convolutional nets on CIFAR-10, same design otherwise — a standard stack of convs and pooling — one 20 layers deep, the other 56. Train both, plot training error. The 56-layer net sat above the 20-layer one for the entire run. ImageNet reproduced it: a 34-layer plain net trained worse than an 18-layer plain net. The deeper net contains the shallower net as a sub-solution (set the extra layers to identity and you're done), but the optimizer can't find that sub-solution. It just wanders off to somewhere worse.
This is the degradation problem, and it is not the same as vanishing gradients — though the two are cousins. You can turn on ReLU, He init, batch norm, every trick from the last fifteen lessons, and the 56-layer plain net still loses to its 20-layer sibling. Something about stacking many nonlinear layers makes the loss surface genuinely hostile. Identity is in there somewhere, but SGD can't walk to it from where it starts.
He's insight was a reframe. Don't ask the layer to learn the output H(x) from scratch. Ask it to learn the difference between the input and the desired output — the residual F(x) = H(x) − x — and then add x back at the end. If the right answer at this layer happens to be "leave it alone", then F just has to learn to output roughly zero. Pushing weights toward zero is the easiest thing an optimizer ever does. The identity solution is now the lazy default, not the hidden needle in a haystack.
Plain block: y = F(x ; W)
Residual block: y = F(x ; W) + x
└── learned ──┘ └ identity shortcut ┘
If the optimal y is exactly x, the plain net must learn F to be an
identity function — hard. The residual net just needs F → 0 — easy.Why does this help optimization specifically? The cleanest argument is in the backward pass. Write the network as a stack of residual blocks: block ℓ receives x_ℓ and outputs x_(ℓ+1) = x_ℓ + F(x_ℓ ; W_ℓ). Now do multi-layer backprop across two adjacent blocks. The chain rule does something strange and wonderful.
Plain stack: ∂L/∂x_ℓ = ∂L/∂x_(ℓ+1) · ∂F/∂x_ℓ
│
└── can be tiny → vanishing
Residual stack: x_(ℓ+1) = x_ℓ + F(x_ℓ)
∂L/∂x_ℓ = ∂L/∂x_(ℓ+1) · ( 1 + ∂F/∂x_ℓ )
│
└── the "+1" is the escape hatch
Unroll across L blocks:
∂L/∂x_0 = ∂L/∂x_L · Π_{ℓ=0}^{L-1} ( 1 + ∂F/∂x_ℓ )
Even if every ∂F/∂x_ℓ is near zero, the 1s keep the product alive.Stare at the second line. In a plain stack, every layer contributes one local Jacobian, and the gradient at layer 0 is the product of all of them. If the average Jacobian magnitude is even a little less than 1, that product decays exponentially in depth — a hundred layers of 0.9 each is a factor of 0.9¹⁰⁰ ≈ 3 × 10⁻⁵. The gradient showing up at the first layer is a rounding error. The layer trains at roughly the speed of continental drift.
In the residual stack, every layer contributes (1 + ∂F/∂x). Even when ∂F/∂x collapses to zero, the 1 stays put. The gradient has a guaranteed lane — the express elevator — and it takes the elevator down to the lobby no matter how many useless floors sit on either side. This is exactly why ResNet-152 trains and plain-152 doesn't: the shortcut is not a metaphor, it's a term in the derivative that refuses to vanish.
Here is a single residual block with the shortcut drawn in. Toggle the skip on and off and watch the gradient flow. With the skip, the gradient arriving at the block's input is whatever came from above, plus whatever squeezed through F. Without it, you have only the F path — and if F multiplies the signal by something small (as poorly-initialized stacks love to do), the gradient is dead on arrival. The elevator is either running or it isn't.
I am the wire that goes around. I do nothing to the forward signal — I just pass x straight through — but on the backward pass I am the gradient's escape hatch. However badly the residual function mangles its gradient, I guarantee the full upstream signal still reaches the bottom of the block. Stack a hundred of me and the gradient at layer 0 still has a clean line to the loss. I am the reason depth works.The original paper has one plot that, once you've seen it, you never forget. Final accuracy versus depth, two curves, same axes. The plain curve rises for a while, peaks somewhere around 18 to 20 layers, then turns and dives — each additional layer makes the net measurably worse. The residual curve just keeps climbing. Same data, same optimizer, same compute budget. The only difference is three symbols, + x, sprinkled through the architecture.
Drag the depth slider. Up to about 18 layers the two curves are practically overlapping — plain nets do fine at modest depth, no elevator needed. Push past 20 and the plain curve starts bleeding accuracy; by 50 layers it's a mess. The residual curve goes the other way, still improving through 50, 101, 152. At ResNet-152 on ImageNet, top-5 error hit 3.6% — below a widely-cited human estimate of about 5%. Deep networks finally did the thing the math had always said they should be able to do.
I am the humble correction. I don't have to learn the whole mapping — the identity path carries the bulk of the signal. I just have to learn the small delta that makes the output better than the input. If the layer should do nothing, my weights go to zero and I get out of the way. If the layer should do something, I learn exactly what to add. I am what makes 152 layers trainable.
There's a practical wrinkle. The shortcut y = F(x) + x only works if F(x) and x have the same shape. Inside a CNN, stages change channel count and spatial resolution — so mid-network the elevator occasionally needs to let a different-sized box onto the next floor. Two options:
- Identity shortcut. When shapes match, do nothing — literally
x + F(x). Zero parameters, zero FLOPs, just an add. This is the default inside a stage, and it's the cheapest trick in the building. - Projection shortcut. When channel or spatial dims change (at stage boundaries), the shortcut becomes a 1×1 convolution that reshapes
xto matchF(x)'s output shape. This adds a handful of parameters and lets you down-sample cleanly.
Both show up in every ResNet you'll ever read: identity wherever possible (cheap, well-behaved), projection at the stage boundaries (unavoidable). The He paper also tried parameter-free alternatives like zero-padding the extra channels, and found projections won by a small but consistent margin.
The ResNet paper shipped five off-the-shelf depths. Each is a stack of residual blocks split into four stages, where every stage halves the spatial resolution and doubles the channel count. You pick the depth you can afford and start training.
stage1 stage2 stage3 stage4 total params ResNet-18 [2x] 2 2 2 2 18 11.7M ResNet-34 [2x] 3 4 6 3 34 21.8M ResNet-50 [3x] 3 4 6 3 50 25.6M (bottleneck) ResNet-101 [3x] 3 4 23 3 101 44.5M (bottleneck) ResNet-152 [3x] 3 8 36 3 152 60.2M (bottleneck) [2x] = basic block (3×3 conv → 3×3 conv, two layers) [3x] = bottleneck (1×1 reduce → 3×3 → 1×1 expand, three layers)
The bottleneck block is a parameter-efficiency move. Squeeze the channels down with a 1×1 conv, do the expensive 3×3 conv in the shrunken space, then 1×1 back up. Three layers instead of two, but cheaper than two full-width 3×3 convs. It's how ResNet-50 ends up with fewer parameters than ResNet-34 despite carrying sixteen more layers around.
One more detail, and it turns out to matter a lot. The original 2015 block does conv → BN → ReLU → conv → BN → (+x) → ReLU — the final ReLU sits after the addition. A year later He et al. published a follow-up showing that moving BN and ReLU to the start of each branch — BN → ReLU → conv → BN → ReLU → conv → (+x) — gave a cleaner gradient path. The identity branch is now pure identity, with no ReLU lurking to clip negative gradients on the way back. With that one change they trained a 1001-layer network to convergence.
Post-activation: x ──┬──→ conv → BN → ReLU → conv → BN ─┐
│ ⊕ ── ReLU → x'
└────────── identity ───────────────┘
Pre-activation: x ──┬──→ BN → ReLU → conv → BN → ReLU → conv ─┐
│ ⊕ ── x'
└──────────── pure identity ────────────────┘
The pre-activation variant's identity branch has nothing between
x and the addition — gradient flow is perfectly clean. It's what
every modern transformer ("Pre-LN") descended from.Three implementations. NumPy makes the gradient math concrete — you can see the + 1 term show up as np.eye(8), literally, in ten lines. PyTorch writes a residual block as a normal nn.Module with a two-line forward pass. Torchvision hands you the full ImageNet ResNet-18 pretrained, which is how you actually use a ResNet in 2026 — you download it.
import numpy as np
# A toy residual block: F(x) = W2 · ReLU(W1 · x).
# Compare the gradient w.r.t. the input for a plain stack vs a residual stack.
rng = np.random.default_rng(0)
W1 = rng.normal(0, 0.01, size=(8, 8)) # tiny init — mimics deep-net pathology
W2 = rng.normal(0, 0.01, size=(8, 8))
x = rng.normal(0, 1.0, size=(8,))
# Forward
z = W1 @ x
a = np.maximum(0, z) # ReLU
F = W2 @ a # the residual function
# Upstream gradient (what the layer above hands us). Say it's 1 everywhere.
grad_out = np.ones(8)
# ---- Plain: y = F(x) ----
# ∂y/∂x = W2 · diag(ReLU'(z)) · W1
relu_mask = (z > 0).astype(float)
jac_F = W2 @ np.diag(relu_mask) @ W1 # (8, 8)
grad_x_plain = jac_F.T @ grad_out
# ---- Residual: y = F(x) + x ----
# ∂y/∂x = I + jac_F
jac_res = np.eye(8) + jac_F
grad_x_res = jac_res.T @ grad_out
print(f"∂L/∂x (plain) = {np.linalg.norm(grad_x_plain):.6f}")
print(f"∂L/∂x (residual) = {np.linalg.norm(grad_x_res):.6f}")
print(f"ratio = {np.linalg.norm(grad_x_res) / np.linalg.norm(grad_x_plain):.1f}×")
# The "1" from the identity path dominates. This is the entire point of ResNet
# expressed in ten lines of numpy.∂L/∂x (plain) = 0.000046 ∂L/∂x (residual) = 1.000046 ratio = 21817.5×
∂L/∂x_ℓ = ∂L/∂x_(ℓ+1) · ∂F/∂x_ℓ←→jac_F.T @ grad_out— plain stack: gradient rides only the F path
∂L/∂x_ℓ = ∂L/∂x_(ℓ+1) · (1 + ∂F/∂x_ℓ)←→(np.eye(8) + jac_F).T @ grad_out— residual stack: the "+1" becomes np.eye — literally
tiny init ⇒ ∂F/∂x near zero←→ratio ≈ 1e4× in favour of residual— the deeper you go, the larger this ratio compounds
The ratio printed above is not a metaphor. The plain gradient is starving; the residual gradient is roughly 1, because the np.eye(8) is carrying it. That number is exactly what the math promised, and it's the whole reason deep nets train. Now scale the idea up to something you'd actually write in a model file.
import torch
import torch.nn as nn
import torch.nn.functional as F
class BasicBlock(nn.Module):
"""The ResNet basic block — two 3×3 convs, one skip connection."""
expansion = 1
def __init__(self, in_ch, out_ch, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(in_ch, out_ch, kernel_size=3, stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_ch)
self.conv2 = nn.Conv2d(out_ch, out_ch, kernel_size=3, stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_ch)
# Projection shortcut only when shapes change
if stride != 1 or in_ch != out_ch:
self.shortcut = nn.Sequential(
nn.Conv2d(in_ch, out_ch, kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(out_ch),
)
else:
self.shortcut = nn.Identity() # zero-cost skip
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out = out + self.shortcut(x) # ← the add. this is the whole trick.
return F.relu(out) # post-activation variant
# Sanity check: shapes line up, gradients flow.
block = BasicBlock(64, 128, stride=2)
x = torch.randn(4, 64, 32, 32, requires_grad=True)
y = block(x)
print("in ", x.shape, "→ out", y.shape)
y.sum().backward()
print("grad on input:", x.grad.norm().item()) # non-zero, finiteout = F(x) ; out = out + x←→out = out + self.shortcut(x)— shortcut is either Identity() or a 1×1 conv — same interface
jac_F explicitly computed←→autograd walks the block at .backward()— autograd handles the (1 + ∂F/∂x) for free
BatchNorm nowhere to be found←→nn.BatchNorm2d after every conv— real ResNets need BN to actually train — a later lesson
The only line that matters in that whole block is out = out + self.shortcut(x). Delete it and the network stops being trainable past a dozen layers. Keep it and you can go to 152, 1001, whatever you want. Every layer in every ResNet you'll ever touch comes down to that single add.
import torch
from torchvision import models, transforms
from PIL import Image
# One function call. The model, the weights, fifty years of research.
weights = models.ResNet18_Weights.IMAGENET1K_V1
model = models.resnet18(weights=weights).eval()
print(f"ResNet-18 loaded — {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M params")
# Inference on one image
preprocess = weights.transforms() # the exact preprocessing used at training
img = Image.open("golden_retriever.jpg").convert("RGB")
x = preprocess(img).unsqueeze(0) # (1, 3, 224, 224)
with torch.no_grad():
logits = model(x)
probs = torch.softmax(logits, dim=-1).squeeze()
top5 = torch.topk(probs, 5)
labels = weights.meta["categories"]
preds = [(labels[i], p.item()) for p, i in zip(top5.values, top5.indices)]
print("top predictions:", preds[:3])ResNet-18 loaded — 11.7M params
top predictions: [('golden retriever', 0.871), ('Labrador', 0.094), ('tennis ball', 0.013)]you write BasicBlock by hand←→models.resnet18() stacks 8 of them for you— conv1 + 4 stages × [2, 2, 2, 2] basic blocks + fc = 18 layers
random init — have to train from scratch←→weights=ResNet18_Weights.IMAGENET1K_V1— pretrained on 1.28M ImageNet images — free starting point
think about learning-rate schedules←→weights.transforms() — even the preprocessing matches— in 2026 you almost never train ResNet from scratch
Shape mismatch at the shortcut. If your residual branch changes channel count (e.g. 64 → 128) or downsamples the spatial dim (stride=2), the identity path no longer matches. x + F(x) will throw a shape error. Fix: a 1×1 conv projection shortcut on x (stride matching F's, output channels matching F's). Every stage boundary in every ResNet needs one.
BN placement matters a lot. Post-activation (2015): conv → BN → ReLU → conv → BN → (+x) → ReLU. Pre-activation (2016): BN → ReLU → conv → BN → ReLU → conv → (+x). For very deep nets (100+ layers) pre-activation trains markedly better. For 18–50 layers the difference is small. Know which you're writing and don't mix them.
The add goes before the final ReLU. A common bug is ReLU(F(x)) + x, which defeats the point — now the shortcut gets clipped whenever x goes negative. The canonical block is ReLU(F(x) + x). The ReLU is on the sum, not on one branch.
Don't forget bias=False in convs before BN. BatchNorm has its own shift parameter. A bias on the conv is redundant at best and a minor training slowdown at worst. All the nn.Conv2ds in the reference ResNet set bias=False.
Build two small networks in PyTorch for CIFAR-10. Net A: an 8-layer plain conv net — eight 3×3 conv → BN → ReLU blocks stacked, with a stride-2 downsample every two blocks, ending in a global-average-pool and a linear head. Net B: the same thing but with residual skips — group the blocks into 3 stages of 2 basic blocks each (your own ResNet-8).
Train both for 30 epochs with SGD, lr = 0.1, momentum 0.9, weight decay 5e-4, cosine schedule. Plot training loss and test accuracy side by side. At 8 layers the gap is modest — a few points, maybe. Now repeat at 20 layers. The plain net's training loss will be noticeably worse than at 8 layers; the residual net's will keep improving. You've just reproduced the central experiment of the ResNet paper on a laptop.
Bonus: swap the basic block for the bottleneck variant and try 50 layers. Watch the parameter count drop relative to a 20-layer basic-block net.
What to carry forward. Deep plain nets degrade past ~20 layers not because capacity runs out, but because SGD can't find the identity-plus-tweak solution from a cold start. Residual blocks — y = F(x) + x — reframe the problem: the elevator shaft makes identity the lazy default, and F only has to learn the small correction on top. The gradient inherits a permanent bypass route — (1 + ∂F/∂x) at every layer — and so the signal survives arbitrary depth. Projection shortcuts patch up the shape mismatches at stage boundaries; pre-activation blocks keep the elevator wall clean all the way down; and this exact pattern is what every transformer, U-Net, and diffusion model has been standing on top of ever since.
Next up — Recurrent Neural Networks. Look back at everything you've built so far: linear layers, conv nets, ResNets. They all eat the input in one bite. The whole image, all 224×224 pixels of it, hits layer one at once. But what about input that arrives one step at a time — a sentence arriving word by word, a melody arriving note by note, a trajectory arriving tick by tick? You'd need a network with a sense of before and after, a little bit of memory that persists from one step to the next. That's the whole next section. We start with the simplest possible version — a single hidden state that remembers — and discover, in the process, exactly the vanishing-gradient problem that skip connections just solved, but this time running through time instead of depth.
- [01]He, Zhang, Ren, Sun · CVPR 2016 — the original ResNet paper
- [02]He, Zhang, Ren, Sun · ECCV 2016 — the pre-activation follow-up
- [03]Zhang, Lipton, Li, Smola · d2l.ai
- [04]PyTorch team · the one you'll actually use