Convolution Operation

The kernel that slides, multiplies, sums.

Easy
~15 min read
·lesson 1 of 6

Your MNIST classifier did something quietly violent on the way in. The first line of the forward pass was x.flatten(). Seven hundred eighty-four pixels, poured into a vector, every neighbor relationship the image had going for it: gone. Pixel (3, 4) and pixel (3, 5) were one step apart on the grid. In the flat vector they're index 88 and index 89 — which is the same distance as index 0 and index 1, which is the same distance as index 0 and index 783.

The dense layer has no way to know any of that. It sees 784 unrelated numbers. If two pixels belong together, the network has to re-discover it from supervision signal alone, and it has to re-discover it separately for every location in the image because the weights that look at the top-left corner have nothing to do with the weights that look at the bottom-right.

MNIST forgives you. The digits are tiny, centered, and there are 60,000 of them to brute-force the problem away. Swap in a 224×224 RGB ImageNet crop — 150,528 inputs, tens of millions of weights in the first layer alone — and the whole setup falls over. An MLP is asking supervision signal to teach it what a neighbor is. It's asking too much.

The fix is simple and a little insulting. Stop throwing away the grid. Look at small patches at a time. Use the same weights for every patch. That's convolution, and the entire reason computer vision works is that we stopped pretending an image was a bag of numbers.

Convolution (personified)
I'm a rubber stamp. A small square of numbers. You press me at one corner of your image, ink a single number onto a new page, slide me one pixel, press again. Same stamp, every press. By the time I've walked the whole image, I've drawn a new picture — an ink map of wherever my pattern fired. That's the whole job.

Let's make the stamp literal. Given an input of shape (H, W) and a kernel of shape (k, k) — the stamp — the output at position (i, j) is the element-wise product of the stamp with the k×k patch of image directly under it, summed into one scalar. Press, read the blot, slide one pixel, press again. Repeat until you've covered the page.

2D convolution — one output pixel
y[i, j]  =  Σ   Σ   x[i + u, j + v]  ·  k[u, v]
           u=0 v=0

                with  u, v  ∈  [0, k-1]

Two knobs decide the shape of the page you end up with: the stride s — how far the stamp jumps between presses — and the padding p — how many rows and columns of zeros you paste around the border so the stamp can hang off the edge without falling off. Plug them in and the output shape drops out:

the shape arithmetic you will do a thousand times
H'  =  ⌊ (H + 2p − k) / s ⌋ + 1
W'  =  ⌊ (W + 2p − k) / s ⌋ + 1

e.g.  H=28, k=3, s=1, p=1   →   H' = (28 + 2 − 3)/1 + 1 = 28      "same" padding
      H=28, k=3, s=1, p=0   →   H' = (28 + 0 − 3)/1 + 1 = 26      "valid" padding
      H=28, k=3, s=2, p=1   →   H' = ⌊(28 + 2 − 3)/2⌋ + 1 = 14    stride-2 downsample

Memorize the formula. You'll type it into a mental calculator every time you design a conv net, every time you debug a shape mismatch, and every time a PyTorch stack trace complains it was expecting (B, 64, 14, 14) and you handed it (B, 64, 13, 13). Convolutions are unforgiving about one pixel.

Watch the stamp do its thing. The 3×3 grid on the left is the stamp. It parks at one corner of the input, the nine numbers underneath it get multiplied by the nine stamp weights, the products get summed, and that single number is inked into the output on the right. Then the stamp shifts one pixel and presses again.

convolution — a kernel slides, multiplies, sums
out = ⌊(in + 2p − k) / s⌋ + 1 · (in=7, k=3, s=1, p=0) → (5, 5)
input (7×7)
0
0
0
1
0
0
0
0
0
1
1
1
0
0
0
1
1
0
1
1
0
1
1
0
0
0
1
1
0
1
1
0
1
1
0
0
0
1
1
1
0
0
0
0
0
1
0
0
0
kernel
-1.0
0.0
1.0
-1.0
0.0
1.0
-1.0
0.0
1.0
output (5×5)
2.0
1.0
0.0
-1.0
-2.0
1.0
-1.0
0.0
1.0
-1.0
1.0
-3.0
0.0
3.0
-1.0
1.0
-1.0
0.0
1.0
-1.0
2.0
1.0
0.0
-1.0
-2.0
output[0][0] =
0.0 · -1.0 +
0.0 · 0.0 +
0.0 · 1.0 +
0.0 · -1.0 +
... + 5 more
= 2.00
out@cursor2.00

Two things to notice. First, the output page is smaller than the input — a 5×5 input with a 3×3 stamp and no padding leaves a 3×3 output. That's the shape formula above enforcing itself, and it's why people reach for padding the instant they want to stack more than three or four conv layers without their feature map shrinking to a single dot. Second, the same nine numbers were used at every press. You aren't training 9 × 9 = 81 weights. You are training exactly 9. Forever. That's the trick the whole field runs on.

Kernel (personified)
I am nine numbers. That's the whole me. I do not care how big your image is — 28 pixels, 1024 pixels, a billboard — I walk across all of it with the same weights. A fully-connected layer on a megapixel image wants a million weights per output neuron. I want nine. That is my entire pitch.

That last observation is one of three properties that together explain why convolutional nets annihilated dense nets on images.

  • Parameter sharing. One stamp, pressed at every position. A 3×3 stamp on a 224×224 image has 9 weights and gets reused 224² ≈ 50,000 times. A dense layer with the same output size would need 224² × 224² ≈ 2.5 billion weights for a single layer. That's a ~300 million times saving. Not a micro-optimization — the difference between possible and not.
  • Translation equivariance. If a feature — an edge, a corner, a nose — shows up top-left in one image and bottom-right in another, the same stamp fires in both. Train the detector once, it works everywhere. A dense net would have to re-learn “nose” at every location independently, which is as absurd as it sounds.
  • Local connectivity. Each output pixel only reads a small k×k window of the input. Early layers see a small neighborhood; stacked layers progressively see larger regions (the receptive field). This matches how pixel-level structure actually behaves — you need a few neighbors to decide “edge”, not the entire image.

Before neural nets, computer vision was a pile of hand-designed stamps. Researchers would sit down and write a 3×3 matrix of numbers because they knew that stamp, pressed across an image, would ink a useful picture — an edge map, a blur, a sharpened version. Flip through a few:

filter gallery — how a single kernel changes the world
Vertical edges: positive where intensity increases left-to-right.
input
kernel
-1.00
0.00
1.00
-2.00
0.00
2.00
-1.00
0.00
1.00
output
negative positive
output range-4.0 … 4.0

The Sobel stamp on the left detects vertical edges because its weights are arranged to compute a horizontal gradient: positive on the right column, negative on the left. Press it across an image and bright blots appear exactly where pixels jump from dark to light horizontally. The blur stamp is uniformly positive — every neighbor contributes equally to the sum — which is, by definition, a local average. The sharpen stamp is subtract-the-blur-from-the- original. Each of these was a Ph.D. thesis in the 1970s.

Here's the punchline. A conv network starts with random weights in its stamps, and after a few epochs of training on images, the stamps it learns look, empirically, a lot like these hand-designed ones. The first conv layer of a trained ImageNet model contains Gabor-like edge detectors, color blobs, little oriented line detectors — the same library of features vision scientists reverse-engineered out of the mammalian V1 cortex in the 1960s. The network rediscovers them because they're the right answer. We just stopped writing them by hand.

Padding (personified)
Without me, every conv layer eats pixels off the border. A 3×3 stamp chews 2 pixels off each side per layer, so after 14 layers your 32×32 image is a single pixel and you're out of signal. I paste zeros around the edge so the stamp can hang off the paper and the output stays the same size as the input. Nobody thinks about me until they're deep into a ResNet and notice the only reason it works is that every block keeps the spatial size constant.

Two generalizations and we have the whole picture.

Multi-channel input. An RGB image isn't (H, W) — it's (3, H, W). To press a stamp on it, the stamp grows a third dimension too: it becomes (3, k, k). At each spatial position you sum the element-wise products across all three channels at once. Still one scalar of ink per press. The channel dimension gets folded into the dot product.

Multi-filter banks. One stamp gives you one feature map. Usually you want many — say 64 different edge / color / texture detectors running in parallel. Stack 64 stamps into a filter bank, and the weight tensor becomes (out_ch, in_ch, k, k). Run the input through all 64 and the output is (64, H', W') — 64 stacked feature maps, one per stamp, one tensor with channel depth.

Put it together with a batch dimension and the canonical PyTorch shape is:

the shapes that matter in a real Conv2d
input   :   (B,  in_ch,  H,   W )
weight  :   (out_ch,  in_ch,  k_h,  k_w)
bias    :   (out_ch,)
output  :   (B,  out_ch,  H',  W')

Three implementations of the same operation. Pure Python to show the nested sum with nothing hidden. NumPy to vectorize what we can. PyTorch to hand the whole thing to a library that will run it on a GPU, differentiate it, and give you production-grade speed in seven lines. Same progression as gradient descent: see the mechanics, scale them, then cede them.

layer 1 — pure python · conv_scratch.py
python
# 2D convolution, no channels, no batch — the textbook version.
def conv2d_naive(x, k):
    H, W = len(x), len(x[0])
    kh, kw = len(k), len(k[0])
    out_h, out_w = H - kh + 1, W - kw + 1         # no padding, stride 1
    y = [[0.0] * out_w for _ in range(out_h)]

    for i in range(out_h):                        # slide down
        for j in range(out_w):                    # slide across
            s = 0.0
            for u in range(kh):                   # window rows
                for v in range(kw):               # window cols
                    s += x[i + u][j + v] * k[u][v]
            y[i][j] = s
    return y

x = [[1, 2, 3, 0, 1],
     [0, 1, 2, 3, 4],
     [2, 1, 0, 1, 2],
     [3, 0, 1, 2, 3],
     [1, 2, 3, 4, 0]]

# vertical-edge detector (Sobel-x)
k = [[-1, 0, 1],
     [-2, 0, 2],
     [-1, 0, 1]]

y = conv2d_naive(x, k)
print("input  (5x5):"); [print('', row) for row in x]
print("output (3x3):"); [print('', row) for row in y]
stdout
input  (5x5):
 [1 2 3 0 1]
 [0 1 2 3 4]
 [2 1 0 1 2]
 [3 0 1 2 3]
 [1 2 3 4 0]
output (3x3):
 [-3  1  1]
 [ 5  1 -3]
 [-5  1  3]

Four nested loops. Correct, obvious, unbearably slow. A 224×224 image through a 64-filter bank running on this code on CPU would outlast a coffee break. Vectorize.

layer 2 — numpy · conv_numpy.py
python
import numpy as np

def conv2d_numpy(x, w, b=None):
    """
    x : (B, in_ch, H, W)
    w : (out_ch, in_ch, kh, kw)
    b : (out_ch,)  or None
    returns (B, out_ch, H-kh+1, W-kw+1)  — valid padding, stride 1.
    """
    B, in_ch, H, W = x.shape
    out_ch, _, kh, kw = w.shape
    H_out, W_out = H - kh + 1, W - kw + 1

    y = np.zeros((B, out_ch, H_out, W_out), dtype=x.dtype)
    for i in range(H_out):
        for j in range(W_out):
            # window: (B, in_ch, kh, kw)
            window = x[:, :, i:i+kh, j:j+kw]
            # einsum collapses in_ch, kh, kw — leaves (B, out_ch)
            y[:, :, i, j] = np.einsum('bchw,ochw->bo', window, w)
    if b is not None:
        y += b.reshape(1, -1, 1, 1)
    return y

# one image, one channel, one 3x3 filter
x = np.array([[1, 2, 3, 0, 1],
              [0, 1, 2, 3, 4],
              [2, 1, 0, 1, 2],
              [3, 0, 1, 2, 3],
              [1, 2, 3, 4, 0]], dtype=np.float32).reshape(1, 1, 5, 5)

k = np.array([[-1, 0, 1],
              [-2, 0, 2],
              [-1, 0, 1]], dtype=np.float32).reshape(1, 1, 3, 3)

# pretend we have a 28x28 input and a bank of 1 filter, valid padding
big = np.random.randn(1, 1, 28, 28).astype(np.float32)
y_big = conv2d_numpy(big, k)
print("output shape:", y_big.shape)

y_small = conv2d_numpy(x, k)
print("top-left 3x3 of output:\n", y_small[0, 0])
stdout
output shape: (1, 1, 26, 26)
top-left 3x3 of output:
 [[-3.  1.  1.]
  [ 5.  1. -3.]
  [-5.  1.  3.]]
pure python → numpy
for u in range(kh): for v in range(kw): s += x[i+u][j+v] * k[u][v]←→np.einsum('bchw,ochw->bo', window, w)

the inner window × kernel dot product becomes a single einsum

no channels — (H, W) only←→(B, in_ch, H, W)

real conv layers always have a batch and channel axis

no bias←→y += b.reshape(1, -1, 1, 1)

one bias scalar per output channel, broadcast across spatial dims

Now the PyTorch version. In real code you'll never write either of the above. One line — nn.Conv2d(in_ch, out_ch, k) — and a tuned library (cuDNN on NVIDIA, MPS on Mac) runs the operation orders of magnitude faster than anything you can manage in Python.

layer 3 — pytorch · conv_pytorch.py
python
import torch
import torch.nn as nn

# A canonical early-layer conv: RGB input, 16 learned filters, 3x3 kernel,
# stride 1, padding 1 (→ "same" padding → output H,W == input H,W).
conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    stride=1,
    padding=1,
)

# Inspect the weight tensor shape — this is the thing most people get wrong.
# It is (out_ch, in_ch, kH, kW) — out-channels first.
print("weight shape:", conv.weight.shape)     # (16, 3, 3, 3)
print("bias shape:  ", conv.bias.shape)       # (16,)

# Batch of 8 RGB 32x32 images.
x = torch.randn(8, 3, 32, 32)
y = conv(x)
print("input shape: ", x.shape)               # (8, 3, 32, 32)
print("output shape:", y.shape)               # (8, 16, 32, 32)

# Total learnable params in this one layer:
#   weight: 16 * 3 * 3 * 3 = 432
#   bias:   16
#   total:  448     — and those 448 numbers are reused across all 32*32 = 1024
#                     spatial positions of every image in the batch.
stdout
weight shape: torch.Size([16, 3, 3, 3])
bias shape:   torch.Size([16])
input shape:  torch.Size([8, 3, 32, 32])
output shape: torch.Size([8, 16, 32, 32])    # same spatial size thanks to padding=1
numpy → pytorch
conv2d_numpy(x, w, b)←→conv = nn.Conv2d(3, 16, 3, padding=1); y = conv(x)

the layer owns its own weight and bias tensors as nn.Parameters

manual padding / stride arithmetic←→padding=1, stride=1 arguments

PyTorch handles the shape math for you — but it will not forgive a mismatch

nested for loops over spatial positions←→cuDNN kernel on GPU

under the hood, conv becomes a giant matrix multiply (im2col + GEMM)

Gotchas

Same vs valid padding: padding=0 (“valid”) means no zeros pasted — the stamp never hangs off the page — so the output shrinks by k − 1 pixels every layer. padding=(k-1)/2 (“same”, for odd k) keeps the spatial size constant. Stack valid conv layers and you'll hit a 0-pixel feature map sooner than you'd guess. ResNets, VGGs, basically every modern conv block use same padding for a reason.

Weight shape order: PyTorch stores the weight as (out_ch, in_ch, kH, kW). TensorFlow defaults to (kH, kW, in_ch, out_ch). Port a model between them without permuting and you'll get silent garbage outputs with no error message — the worst kind of bug.

Forgetting the channel dimension: the most common bug in early CNN code is writing a kernel of shape (k, k) for a 3-channel image. PyTorch refuses, but refuses with a stack trace three screens long. The kernel must always have an in_ch axis, even if in_ch == 1.

Off-by-one in the shape formula: the floor in ⌊(H + 2p − k)/s⌋ + 1 is not a formality. With H=28, k=3, s=2, p=0 you get 13, not 13.5. If your loss is NaN at step one, check the output shape before anything else.

Hand-code a Sobel edge detector as a Conv2d

Build an nn.Conv2d(1, 2, kernel_size=3, padding=1, bias=False) — 1 input channel, 2 output channels, 3×3 kernels, no bias. Manually set the two 3×3 weight matrices to the horizontal and vertical Sobel stamps: [[-1,0,1],[-2,0,2],[-1,0,1]] and [[-1,-2,-1],[0,0,0],[1,2,1]] .

Load a grayscale image (any (1, 1, H, W) tensor — a letter glyph from MNIST is fine), press your Conv2d across it, and plot both output channels. You should see a vertical-edge map in channel 0 and a horizontal-edge map in channel 1.

Verify correctness: run scipy.ndimage.sobel(img, axis=1) on the same image and check that it matches your Conv2d output (up to sign convention and a border-handling difference). When it matches, you've just proved to yourself that the operation scipy has spent 20 years optimizing is the same operation PyTorch learns from scratch in any CNN.

Bonus: set conv.weight.requires_grad = True, run a tiny edge-detection training loop, and watch the initial Sobel weights drift. What's the network finding that Sobel missed?

What to carry forward. A convolution is a small stamp sliding across a grid, pressing one dot product per position onto a new page. Same stamp everywhere (parameter sharing), same detector works at any location (translation equivariance), each press only reads a local neighborhood (local connectivity). Three knobs shape the output: kernel size, stride, padding. Output shape is ⌊(H + 2p − k)/s⌋ + 1, and you will use that formula more often than you'd like. PyTorch's nn.Conv2d stores weights as (out_ch, in_ch, k, k); miss the channel dimension or confuse the order and nothing works.

Next up — Pooling. A conv layer keeps the spatial size (with same padding) — which means your feature maps keep growing. Stack 64 filters, then 128, then 256, and the tensor at layer 10 is enormous. A second operation earns its keep by trimming them down: take a 2×2 window, keep the max or the mean, throw the rest away. Pooling keeps the gist, drops the pixels. That's what turns a stack of conv layers into a real CNN.

References