Build a CNN from Scratch
A LeNet-style conv net, one layer at a time in NumPy.
You have the ingredients. A convolution — a learned filter sliding across an image, firing on local patterns. A pool — a subsampler that halves the resolution and hands the next layer a bigger view of the world. A ReLU wedged between them so the whole thing isn't secretly linear. By themselves, each of those is a parlor trick. Stack them, and you get a machine that reads images.
That stack has a shape, and the shape is the whole point. Picture a cake, baked upside down. The top layer — the one touching the input image — is wide and shallow: 28×28 pixels of spatial area, three channels if it's color, one if it's grayscale. Each layer down, the cake gets narrower spatially and deeper in channels. 14×14×32. 7×7×64. 4×4×128. The last slice is a dense little cube of features, and that cube gets flattened and handed to a classifier that decides what it was looking at. Every conv trades pixels for features. Every pool trades “where exactly” for “what kind.”
The target of this lesson is LeNet-5. Yann LeCun, 1998. The paper that drew the convnet the way every whiteboard still draws it — conv, pool, conv, pool, flatten, dense, dense, done. Five trainable layers, about sixty thousand parameters, and it hit 99%+ on MNIST back when the competition was bolting handcrafted features onto SVMs. By the end of this page you will have built it, watched its shapes tumble down the stack, peeked at what its neurons actually see, and trained a PyTorch version to within a minute of LeCun's 1998 numbers.
One benchmark to frame the stakes. The 2-layer MLP from the last section tops out around 97.5% on MNIST with roughly 100K parameters. LeNet does better with fewer. That gap is the whole argument for convolution, and it's about to land.
I was the first convnet to really work. Handwritten digits on bank cheques, running on a machine slower than the phone in your pocket, and I beat the humans. Everything you see today — ResNet, EfficientNet, ConvNeXt — is me with more layers, better normalization, and a GPU.
Here is the cake, sliced from top to bottom. Read it as a descent: the input is a wide spatial map with almost no channels; the output is a narrow map with many channels, eventually a plain vector, and then a prediction. Every arrow is a layer; every box is a tensor with a shape; every shape tells you how much spatial resolution has collapsed into how many channels so far.
input image (1, 28, 28) — one grayscale channel
│
▼ Conv2d(in=1, out=6, k=5) + ReLU
(6, 24, 24) — 6 feature maps, edges & strokes
│
▼ MaxPool2d(k=2, s=2)
(6, 12, 12) — half resolution, same channels
│
▼ Conv2d(in=6, out=16, k=5) + ReLU
(16, 8, 8) — 16 feature maps, compound patterns
│
▼ MaxPool2d(k=2, s=2)
(16, 4, 4) — receptive field now ~14×14 of input
│
▼ flatten
(256,) — 16 · 4 · 4 = 256 numbers per image
│
▼ Linear(256 → 120) + ReLU
(120,)
│
▼ Linear(120 → 84) + ReLU
( 84,)
│
▼ Linear( 84 → 10) (logits for classes 0..9)
( 10,)
│
▼ softmax → p(class | image)Build the cake yourself. Drag conv and pool blocks into the stack, pick kernel sizes, watch the output shape and parameter count update per layer. The floor of the widget is the total-param and total-FLOP counter — a quick way to feel which slices of the cake are expensive and which are nearly free.
| # | layer | shape | params | FLOPs | |
|---|---|---|---|---|---|
| 0 | input | (3, 32, 32) | 0 | 0 | |
| 1 | conv 5×5 | (6, 28, 28) | 456 | 705.6K | |
| 2 | pool 2×2 | (6, 14, 14) | 0 | 0 | |
| 3 | conv 5×5 | (16, 10, 10) | 2.4K | 480.0K | |
| 4 | pool 2×2 | (16, 5, 5) | 0 | 0 | |
| 5 | fc 120 | (120, 1, 1) | 48.1K | 96.0K | |
| 6 | fc 84 | (84, 1, 1) | 10.2K | 20.2K | |
| 7 | fc 10 | (10, 1, 1) | 850 | 1.7K |
Two things to notice while you drag. First, convolution layers are shockingly cheap in parameters. A Conv2d(6, 16, k=5) has 6·16·5·5 + 16 = 2,416 parameters — total. A single Linear(256, 120) has 256·120 + 120 = 30,840, more than twelve times as many. LeNet's two conv layers combined use about 2.6K params; its three FC layers use about 58K. The convnet's weights are almost all hiding in the FC head. The feature extractor — the part everyone talks about — is the skinny one.
Second, FLOPs tell the opposite story. A single conv layer is dirt cheap in parameters but expensive in multiply-adds, because each kernel is applied at every spatial position. Parameters and compute are decoupled for convolutions in a way they simply are not for fully connected layers. This is why convnets scale: you can grow the feature-extraction capacity without blowing up the parameter budget. It's also why your GPU fan turns on during training even though the model is “only” 60K parameters.
Before we code this, burn the shape arithmetic into your hands. This is the one formula you will use more than any other in the next four lessons. For a 2D convolution with input spatial size H, kernel k, padding p, and stride s, the output size along that axis is:
H_out = ⌊ (H_in + 2p − k) / s ⌋ + 1 W_out = ⌊ (W_in + 2p − k) / s ⌋ + 1 params = C_out · (C_in · k · k + 1) ( "+1" is the bias per filter ) FLOPs ≈ 2 · C_out · C_in · k · k · H_out · W_out
Plug in LeNet's first conv. H = 28, k = 5, p = 0, s = 1:
Conv1: H = (28 + 0 − 5)/1 + 1 = 24 → (6, 24, 24) Pool1: H = 24 / 2 = 12 → (6, 12, 12) Conv2: H = (12 + 0 − 5)/1 + 1 = 8 → (16, 8, 8) Pool2: H = 8 / 2 = 4 → (16, 4, 4) Flatten: 16·4·4 = 256 FC1: 256 → 120 FC2: 120 → 84 FC3: 84 → 10
I am the patch of the input image that a neuron in layer k can actually see. In conv1, I am just a 5×5 window. After pool1 I double: 10×10. After conv2 I grow to 14×14. After pool2, 28×28 — by the end of feature extraction, a single neuron in a 4×4 map is looking at the entire input. That expansion is why deeper layers can recognize whole-digit shapes while shallow layers only see strokes.
Now actually look at what the network has learned. Pick a layer, pick an input digit, see the feature maps at that depth. Conv1 maps will look like edges — pen strokes at various orientations. Conv2 maps are already abstract; some fire on whole loops, some on corners, some on things your visual cortex would refuse to name. The FC layers don't render as 2D maps (they're just vectors), which is itself the point — spatial structure lives in the conv stack and gets discarded the moment you flatten.
This is the feature hierarchy everyone talks about. Early layers learn generic local statistics — edges, strokes, blobs. Middle layers compose those into textured patterns: loops, T-junctions, corners. Late layers compose those into object-level concepts — “this looks like an 8.” Nobody told the network to organize itself this way. It falls out, reliably, from training a stack of conv+pool with a classification loss. The deeper into the cake you go, the more each neuron is about the object rather than about pixels.
On ImageNet-scale networks this hierarchy gets theatrical: layer 1 is Gabor filters, layer 5 is textures, layer 20 has neurons that fire on faces, or on written text, or on bodies of water. LeNet on MNIST only has two conv layers so the story is muted — but the mechanism is identical, and the shape of the cake is the same.
I am what emerges when you stack me. Layer one: edges and strokes. Layer two: corners, curves, simple shapes. Layer five: object parts — eyes, wheels, wings. Layer twenty: whole objects. Nobody designed me this way. I am the free lunch that falls out of conv + pool + backprop.
Time to build it. Pure Python is not happening for a convnet — a single forward pass of LeNet is already around two million multiply-adds, and a four-deep Python loop would take minutes per image. So this lesson runs the progression as NumPy → PyTorch, with the NumPy version as a working reference implementation you can read end to end, and PyTorch as the version you'd actually train.
import numpy as np
def conv2d(x, W, b):
"""x: (C_in, H, W_in), W: (C_out, C_in, k, k), b: (C_out,).
No padding, stride 1 — vanilla LeNet."""
C_out, C_in, k, _ = W.shape
_, H, W_in = x.shape
H_out, W_out = H - k + 1, W_in - k + 1
out = np.zeros((C_out, H_out, W_out))
for i in range(H_out):
for j in range(W_out):
patch = x[:, i:i+k, j:j+k] # (C_in, k, k)
# inner product against every filter
out[:, i, j] = (W * patch).sum(axis=(1, 2, 3)) + b
return out
def relu(x):
return np.maximum(0, x)
def maxpool2d(x, k=2, s=2):
C, H, W = x.shape
H_out, W_out = H // s, W // s
out = np.zeros((C, H_out, W_out))
for i in range(H_out):
for j in range(W_out):
out[:, i, j] = x[:, i*s:i*s+k, j*s:j*s+k].max(axis=(1, 2))
return out
def linear(x, W, b):
return W @ x + b
# Random weights — in practice these come from training, not from a seed.
rng = np.random.default_rng(0)
W1 = rng.normal(0, 0.1, (6, 1, 5, 5)); b1 = np.zeros(6)
W2 = rng.normal(0, 0.1, (16, 6, 5, 5)); b2 = np.zeros(16)
W3 = rng.normal(0, 0.1, (120, 256)); b3 = np.zeros(120)
W4 = rng.normal(0, 0.1, (84, 120)); b4 = np.zeros(84)
W5 = rng.normal(0, 0.1, (10, 84)); b5 = np.zeros(10)
def lenet_forward(x):
a = relu(conv2d(x, W1, b1)); print("after conv1+relu:", a.shape)
a = maxpool2d(a); print("after pool1: ", a.shape)
a = relu(conv2d(a, W2, b2)); print("after conv2+relu:", a.shape)
a = maxpool2d(a); print("after pool2: ", a.shape)
a = a.reshape(-1); print("after flatten: ", a.shape)
a = relu(linear(a, W3, b3))
a = relu(linear(a, W4, b4))
return linear(a, W5, b5) # logits
x = rng.normal(0, 1, (1, 28, 28))
print("input shape:", x.shape)
logits = lenet_forward(x)
print("logits: ", logits.shape)
print("predicted digit: ", int(np.argmax(logits)))input shape: (1, 28, 28) after conv1+relu: (6, 24, 24) after pool1: (6, 12, 12) after conv2+relu: (16, 8, 8) after pool2: (16, 4, 4) after flatten: (256,) logits: (10,) predicted digit: 7
Every layer lives in twenty lines. The cake is right there in the print statements — each print is one horizontal slice of the architecture, and the shapes narrow and channel-count grows exactly like the diagram promised. What's missing is training. A NumPy backward pass through conv2d and maxpool2d is doable — it is also another 150 lines and intolerably slow. That is exactly the bargain PyTorch is selling.
for i, j in H×W: patch = x[:, i:i+k, j:j+k]←→im2col: reshape all patches into a big matrix— turns conv into one giant matmul — this is what cuDNN does under the hood
np.zeros; assign per-position←→W_flat @ patches_matrix— single BLAS call, 100-1000× faster than the python loop
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
class LeNet5(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 6, kernel_size=5) # (1,28,28) → (6,24,24)
self.conv2 = nn.Conv2d(6, 16, kernel_size=5) # (6,12,12) → (16,8,8)
self.fc1 = nn.Linear(16 * 4 * 4, 120) # (256,) → (120,)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = F.max_pool2d(F.relu(self.conv1(x)), 2) # conv1 + relu + pool
x = F.max_pool2d(F.relu(self.conv2(x)), 2) # conv2 + relu + pool
x = x.flatten(1) # keep batch, flatten the rest
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x) # raw logits — CE loss wants these
# Data
tfm = transforms.ToTensor()
train = DataLoader(datasets.MNIST('.', train=True, download=True, transform=tfm), batch_size=128, shuffle=True)
test = DataLoader(datasets.MNIST('.', train=False, download=True, transform=tfm), batch_size=512)
# Model + optim
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = LeNet5().to(device)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(5):
model.train()
for xb, yb in train:
xb, yb = xb.to(device), yb.to(device)
opt.zero_grad()
loss = F.cross_entropy(model(xb), yb)
loss.backward()
opt.step()
model.eval()
correct = 0
with torch.no_grad():
for xb, yb in test:
xb, yb = xb.to(device), yb.to(device)
correct += (model(xb).argmax(1) == yb).sum().item()
print(f"epoch {epoch+1} loss {loss.item():.3f} test acc {100*correct/10000:.1f}%")
print("total parameters:", sum(p.numel() for p in model.parameters()))epoch 1 loss 0.342 test acc 97.4% epoch 2 loss 0.087 test acc 98.6% epoch 3 loss 0.061 test acc 98.9% epoch 5 loss 0.039 test acc 99.1% total parameters: 61,706
4 nested loops per conv layer←→nn.Conv2d — one line, GPU-accelerated— cuDNN does im2col + batched matmul under the hood
no backward implementation at all←→loss.backward() traces through conv + pool for free— autograd knows the backward for every torch op
single image forward←→batched forward over 128 images in parallel— the leading dim is batch — conv/pool broadcast over it
weights frozen to random init←→Adam + cross-entropy → 99.1% test accuracy in 5 epochs— the whole learning loop in 30 lines
Now the part where the cake gets eaten. Up to pool2, every tensor in the network is spatially organized — you could point at a neuron and say “this one is watching the upper-left corner of the image.” The flatten step ends that. 16 × 4 × 4 = 256 becomes a flat vector of 256 numbers, the spatial identity of each feature is discarded, and the final three Linear layers treat those 256 numbers the same way an MLP treats raw pixels. The cake has been reduced to a smoothie.
That smoothie is then passed through softmax at inference time to turn the 10 raw logits into a probability distribution over digit classes, and scored against the true label with cross-entropy at training time. Same classification head you've built before, just fed 256 learned features instead of 784 raw pixels. That substitution is the reason LeNet beats the MLP.
Forgetting to flatten before the FC head. A Linear layer wants a 1D input per example (2D total, with batch). If you pass in (N, 16, 4, 4) it will error — the cake is still 3D and the classifier won't eat it. Always x.flatten(1) or x.view(N, -1) before the first FC.
Channel dim comes first. PyTorch uses (N, C, H, W) — batch, channel, height, width. If you hand it (N, H, W, C)(TensorFlow convention) the conv interprets height as channels and you get a silent disaster. MNIST tensors need an explicit channel dim: x.unsqueeze(1) to go from (N, 28, 28) to (N, 1, 28, 28).
Wrong input size to adaptive pool / FC. If your input is not 28×28 — say you train LeNet on 32×32 CIFAR without adjusting — the feature map after pool2 is no longer 4×4 and the Linear(256, 120) blows up. Either recompute the flattened size from the shape formula, or use an nn.AdaptiveAvgPool2d((4, 4)) just before the flatten to force a canonical shape regardless of input size. This is the single most common bug in other people's PyTorch code.
FC-head parameter blowup. Scale the feature map up — say you switch to 224×224 ImageNet-style inputs without pooling more aggressively — and the tensor reaching flatten becomes massive. A (512, 28, 28) flatten feeding a Linear(400K, 4096) is 1.6B parameters from a single layer. This is exactly why modern architectures use AdaptiveAvgPool2d((1, 1)) right before the classifier: collapse every channel to one number and keep the FC head small.
Softmax inside the model when the loss expects logits. F.cross_entropy applies log-softmax internally. If you softmax in the forward pass too, you double-softmax and training crawls. Return raw logits from the model; let the loss handle the normalization.
Train the LeNet above on MNIST to at least 99.0% test accuracy. Note the parameter count (~62K) and wall-clock training time.
Now train a 2-layer MLP — nn.Sequential(nn.Flatten(), nn.Linear(784, H), nn.ReLU(), nn.Linear(H, 10)) — with H tuned just large enough to hit the same 99.0% test accuracy. How big does H need to be? How many parameters does that MLP have? Spoiler: it either won't hit 99% at all, or it will need an H in the low thousands and a parameter count well past LeNet's.
Bonus: shift all test digits 3 pixels to the right before evaluating. LeNet should degrade gracefully because convolutions are translation-equivariant — the filter doesn't care whether the edge it's looking for is at column 4 or column 7. The MLP will fall off a cliff, because every pixel is a separate feature and they all just moved. That is the “inductive bias” of convolution, quietly earning its keep.
What to carry forward. A convnet is a layer cake baked upside down: wide and shallow at the top (big spatial map, few channels), narrow and deep at the bottom (tiny spatial map, many channels). Every conv adds channels. Every pool shrinks space. The last slice of the cake gets flattened and fed to a plain MLP classifier, which is where almost all the parameters actually live — the feature extractor is cheap, the head is expensive. Shape arithmetic is (H + 2p − k) / s + 1 — memorize it; you will use it in every convnet you touch. Feature hierarchy (edges → textures → parts → objects) emerges for free from training a deep enough stack with a classification loss. Modern architectures swap 5×5 for stacked 3×3, replace pool with strided conv, and inject BatchNorm — but the cake-shape is still LeNet.
Next up — Image Classifier (CIFAR-10). You have built the architecture. You have not yet shipped it on anything harder than centered grayscale digits. MNIST is the friendliest vision benchmark that exists; CIFAR-10 is 32×32 color photographs of ten object classes — cats that aren't centered, dogs on textured backgrounds, planes at angles — and LeNet on CIFAR without data augmentation tops out around 65%. Next lesson we scale up the cake: padded 3×3 convs, BatchNorm, deeper stacks, random crops and flips. Same three pieces, stretched into something that actually reads the world. The architecture is yours; now it's time to train it on real images.
- [01]Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner · Proceedings of the IEEE, 1998 — the original LeNet-5 paper
- [02]Karen Simonyan, Andrew Zisserman · ICLR 2015 — the VGG paper, 3×3 everywhere
- [03]Zhang, Lipton, Li, Smola · d2l.ai
- [04]Krizhevsky, Sutskever, Hinton · NeurIPS 2012 — AlexNet, the scaling-up of LeNet