Pooling

Max, average, and why we downsample.

Easy
~15 min read
·lesson 2 of 6

A convolution at full resolution is expensive — and most of what it produces is redundant. A 224×224 image through one 64-channel conv is 224 · 224 · 64 ≈ 3.2M activations. Stack six of those and you're hauling around twenty million floats per forward pass, most of them encoding minor variations of “this same edge, shifted over by one pixel.” The network is doing the work of a detail-obsessed intern: faithful, exhaustive, and mostly wasted.

Here's the anchor for the whole lesson: the thumbnail — keep the gist, drop the pixels. When your phone generates a thumbnail of a photo, it keeps the subject (“there's a face, roughly here”) and throws away the pixel-level particulars (“the exact position of the left eye”). That's precisely what you want to do to a feature map once the conv has detected its features. You don't need the edge at pixel (37, 42). You need to know the edge fired in this general neighborhood.

That's pooling. Slide a 2×2 window with stride 2, collapse each window to a single number, and you've halved both spatial dimensions — quartered the activation count — for the cost of a comparison. No weights. Nothing to learn. Just a shrink.

Pooling (personified)
I have no weights. I do not learn. I look at a neighborhood, summarize it in one number, and throw the rest away. Your downstream layers thank me: they see the same features at a quarter the cost and twice the effective reach.

The output shape follows the same arithmetic you already memorized for convolutions. Window size k, stride s, no padding:

output shape — same arithmetic as conv, no padding
H_out  =  ⌊ (H_in − k) / s ⌋ + 1
W_out  =  ⌊ (W_in − k) / s ⌋ + 1

typical:   k = 2,  s = 2
           →  H_out = H_in / 2,  W_out = W_in / 2
           channels unchanged

Two things trip people up. Channels don't change. Pooling runs per-channel — a 32-channel feature map goes in, 32 channels come out, just smaller in H and W. Each channel gets its own independent thumbnail. And stride equals kernel size in the textbook 2×2 pool: windows tile the input edge to edge, no overlap. Overlapping pools exist (AlexNet used 3×3 kernel with stride 2), but the modern default is clean tiling.

Two flavors of thumbnail. Flip between them in the widget below and watch how differently they summarize the same input.

pooling — a 2×2 window strides across, collapsing each block
in (8×8) · k=2 · stride=2 → out (4×4)
input
5
3
0
0
0
0
0
3
5
8
6
7
2
1
5
3
5
6
1
6
4
6
3
1
7
4
0
3
0
0
2
9
3
7
2
3
2
7
0
5
0
1
4
1
1
2
9
1
1
7
1
1
2
9
9
7
1
3
0
7
1
2
1
6
output
8
7
2
5
7
6
6
9
7
4
7
9
7
7
9
9
window[5, 3, 5, 8]
→ out8

Max pool is the ruthless thumbnail. Each 2×2 window forwards exactly one value — the loudest — and discards the other three without ceremony. If a neuron anywhere in that window was screaming about a vertical edge, max pool forwards the scream at full volume. This is the right instinct mid-network, where a conv filter's whole personality is “I fire hard when I see my feature and stay quiet otherwise” — you want to preserve the fire.

Average pool is the soft thumbnail. It smooths. Three zeros and a nine become a 2.25, not a 9. Fine if what you want is a summary; actively wrong if the signal you care about is precisely “the loudest value in this neighborhood.” You don't average a smoke alarm with three silent rooms and call that a fire detector.

Max pool (personified)
Winner-take-all. Four values in, one value out — the loudest. If your convolution detected an edge anywhere in my window, I preserve the detection at full strength. I lose position information within the window, which is exactly the point: downstream layers need to know the feature is here-ish, not that it sat at pixel (5, 7).

Here's why pooling earns its keep. That one shrink does three jobs at the same time.

Job one: cost. Spatial dims halve, activations quarter. Every downstream conv runs on a smaller grid. On a deep network this compounds into the difference between “trainable on a single GPU” and “trainable on a cluster your lab doesn't own.”

Job two: translation invariance. Nudge the input over by one pixel. The max over a 2×2 window almost certainly picks the same value. The output barely moves. That's not a philosophical claim about invariance in the deep sense — it's a mechanical fact about what happens when you take a max over a small region. The thumbnail doesn't care about subpixel jitter, and neither does the next layer up.

Job three: receptive field. This one is quieter and more important. Each pool doubles the effective receptive field of every downstream neuron. A 3×3 conv sees a 3×3 patch of its input. Stack two 3×3 convs and the top one sees 5×5 of the original. Slip a pool between them and the top conv now sees a 6×6 region — because each of its inputs was already a summary over a 2×2 patch. Four (conv, pool) blocks and the final layer is reasoning over a ~30×30 chunk of the original image while computing on a ~14×14 grid.

That's the whole game. You want a neuron deep in the network to see enough of the image to recognize a dog's face. You can get there with (a) huge kernels early — expensive and clumsy — or (b) small kernels with repeated shrinks. CNN architecture is a long, committed bet on option (b).

downsampling chain — spatial dims shrink, channels grow
conv doubles C · pool halves H,W · receptive field doubles per pool
input
3×64×64
RF 1
conv1
6×64×64
RF 3
pool1
6×32×32
RF 6
conv2
12×32×32
RF 8
pool2
12×16×16
RF 16
conv3
24×16×16
RF 18
pool3
24×8×8
RF 36
conv4
48×8×8
RF 38
pool4
48×4×4
RF 76
input conv (3×3, +C) max-pool (2×2, /HW)
shape48 × 4 × 4
receptive field76 × 76
#params13.9K

Notice the inverted pyramid. Spatial dims halve at every pool. Channels grow the other way — 32 → 64 → 128 → 256 — because as each neuron sees more of the input, it takes more channels to describe all the things that bigger chunk might contain. Shallow layers recognize edges. Deep layers recognize object parts. A face-detector genuinely needs more channels than an edge-detector. Total activations per layer stay roughly flat (H·W shrinks 4×, C doubles) while the semantic density climbs.

One more thumbnail, turned up to eleven: global average pooling (GAP). Instead of a 2×2 window, the window is the entire feature map. A 7×7×512 tensor goes in, a 1×1×512 vector comes out — each channel squeezed to a single number, its mean over every spatial position. It's the softest possible summary: one number per channel for the whole image.

Before 2013, the standard classifier tail was conv → flatten → dense(4096) → dense(num_classes). That flatten-then-dense combo is enormous — a 7·7·512 = 25,088-dim flatten feeding a dense layer gives you 25,088 · 4096 ≈ 100M parameters in a single layer. Most of AlexNet's weights lived there. Worse, it's brittle: flatten assumes the feature map is exactly 7×7. Change input size, rebuild the classifier.

GAP replaces the whole thing with global_avg_pool → dense(num_classes). No flatten, no 25K-dim vector, zero parameters in the pool, and the network suddenly works on any input size — because the pool collapses whatever H×W you hand it to 1×1. That's why every serious classifier from GoogLeNet onward ends with a GAP before the head.

Global avg pool (personified)
I am the summary. Give me a feature map and I return one number per channel — the channel's average response over the whole image. I erase the 100-million-parameter dense layers your grandparents' networks used, and I refuse to overfit because I have no weights to overfit with. Every good classification head ends with me.

Time to write it. Max pool with a 2×2 window, stride 2, three ways: nested Python loops, a NumPy reshape trick, the one-line PyTorch call. All three produce identical numbers on the same input.

layer 1 — pure python · pooling_scratch.py
python
def max_pool_2x2(x):
    """x is a 2D list (H, W). Returns H/2 × W/2."""
    H, W = len(x), len(x[0])
    out = [[0] * (W // 2) for _ in range(H // 2)]
    for i in range(H // 2):
        for j in range(W // 2):
            window = [
                x[2*i    ][2*j], x[2*i    ][2*j + 1],
                x[2*i + 1][2*j], x[2*i + 1][2*j + 1],
            ]
            out[i][j] = max(window)            # the whole operation: one max()
    return out

x = [[1, 3, 2, 1],
     [4, 2, 1, 0],
     [5, 6, 1, 2],
     [7, 8, 3, 4]]

for row in max_pool_2x2(x):
    print(row)
stdout
input (4×4):
  [[ 1  3  2  1]
   [ 4  2  1  0]
   [ 5  6  1  2]
   [ 7  8  3  4]]
max-pool 2×2 stride 2 (2×2):
  [[4, 3],
   [8, 4]]

Two nested loops, one max() call. That's the entire algorithm — every line you've read so far was motivation. The pure-Python version is plenty for a 4×4 grid; on a real feature map with channels and batch, we vectorize.

layer 2 — numpy · pooling_numpy.py
python
import numpy as np

def pool2d(x, mode="max"):
    """x is (H, W) with H, W divisible by 2. Returns H/2 × W/2."""
    H, W = x.shape
    # Key trick: reshape into (H/2, 2, W/2, 2) — the 2s are the within-window axes.
    tiles = x.reshape(H // 2, 2, W // 2, 2)
    if mode == "max":
        return tiles.max(axis=(1, 3))          # collapse the two window axes
    else:
        return tiles.mean(axis=(1, 3))

x = np.array([[1, 3, 2, 1],
              [4, 2, 1, 0],
              [5, 6, 1, 2],
              [7, 8, 3, 4]])

print("max: ", pool2d(x, mode="max"))
print("avg: ", pool2d(x, mode="avg"))
stdout
max:  [[4 3]
 [8 4]]
avg:  [[2.5 1. ]
 [6.5 2.5]]
pure python → numpy
for i in H/2: for j in W/2: …←→x.reshape(H/2, 2, W/2, 2)

the 2s are the within-window axes — reduce along them

max(window)←→tiles.max(axis=(1, 3))

broadcasted max, runs as a single kernel

change max() to sum()/4 for avg-pool←→tiles.mean(axis=(1, 3))

one function, mode flag — same shape math

PyTorch ships all of this. You will not hand-roll a pool in production — you will have to pick between MaxPool2d, AvgPool2d, and the adaptive (global) variants, which is why it's worth knowing all three.

layer 3 — pytorch · pooling_pytorch.py
python
import torch
import torch.nn as nn

x = torch.randn(1, 32, 16, 16)                  # (batch, channels, H, W)

max_pool   = nn.MaxPool2d(kernel_size=2, stride=2)
avg_pool   = nn.AvgPool2d(kernel_size=2, stride=2)
gap        = nn.AdaptiveAvgPool2d(output_size=1)   # reduces any H×W to 1×1

print("input shape:         ", x.shape)
print("after maxpool 2x2:   ", max_pool(x).shape)
print("after avgpool 2x2:   ", avg_pool(x).shape)

gap_out = gap(x)
print("after global avgpool:", gap_out.shape)
print("GAP output (flat):   ", gap_out.flatten(1).shape)   # ready for nn.Linear

# Note: pooling has no parameters.
print("params in maxpool:", sum(p.numel() for p in max_pool.parameters()))
stdout
input shape:          torch.Size([1, 32, 16, 16])
after maxpool 2x2:    torch.Size([1, 32, 8, 8])
after avgpool 2x2:    torch.Size([1, 32, 8, 8])
after global avgpool: torch.Size([1, 32, 1, 1])
GAP output (flat):    torch.Size([1, 32])
numpy → pytorch
x.reshape(...).max(axis=(1, 3))←→nn.MaxPool2d(2, 2)(x)

handles any kernel/stride, runs on GPU, no shape gymnastics

pool2d(x, mode="avg")←→nn.AvgPool2d(2, 2)(x)

same call signature — mode becomes the class name

tiles.mean(axis=(1, 2, 3)) # whole map←→nn.AdaptiveAvgPool2d(1)(x)

GAP — collapses any H×W to 1×1, size-agnostic

Gotchas

Stride ≠ kernel size: the familiar 2×2 pool has kernel_size = stride = 2, but those are two independent knobs. A 3×3 kernel with stride 2 (AlexNet-style overlapping pool) is a different animal — the windows overlap and H_out arithmetic changes. Write it out before you ship it.

Pool has no parameters: if an architecture diagram lists pool layers in the param count, someone drew it wrong. You can insert or remove pool layers without touching the weight file — only the activation shapes move. Handy when converting a classifier to a fully-convolutional segmentation net.

Avg-pool mid-network: smooths away the sharp responses your conv layers worked hard to produce. Unless you have a specific reason (some segmentation nets, some low-level image processing), mid-network pools should be max. Save avg for the classifier tail (GAP).

Odd input dims: a 2×2 stride-2 pool on a 7×7 map produces a 3×3 output — the last row and column vanish silently. Either pad, use AdaptiveAvgPool2d, or make your conv strides keep the spatial dims divisible by your pool stride.

Build a feature pyramid

Stack four (Conv2d 3×3, ReLU, MaxPool2d 2×2) blocks. Start with an input of shape (1, 3, 64, 64) and channel widths 3 → 32 → 64 → 128 → 256. Print the output shape after each block.

You should see spatial dims halve at every block: 64 → 32 → 16 → 8 → 4, while channels grow: 3 → 32 → 64 → 128 → 256. That's the canonical feature-pyramid geometry.

Now compute the receptive field at each depth. A 3×3 conv adds 2. A 2×2 stride-2 pool multiplies by 2 (and adds 1 for the window). Work it out by hand, then verify: at depth 4, each output cell should correspond to a ~30×30 patch of the input.

Bonus: replace every MaxPool2d(2, 2) with Conv2d(stride=2) (strided conv). Measure the parameter count difference and the output shape. This is the modern move that largely replaced pool in ResNet v2 and beyond.

What to carry forward. Pooling is the thumbnail step: keep the gist, drop the pixels. Max pool preserves sharp feature responses in the middle of a network; global average pool replaces the flatten-plus-dense monster at the classifier head and shaves millions of parameters. The same shrink buys you three things at once — cheaper compute, a little translation invariance, and a doubled receptive field downstream — which is why a zero-parameter layer has survived three architectural eras. The operation has no gradient of its own to learn: max pool routes the upstream gradient to whichever input cell was the winner, avg pool splits it evenly across the window. Nothing to train.

Next up — Build a CNN. You now have every part: convolutions that stamp features onto a map, activations that keep the network non-linear, and pools that summarize. The next lesson bolts them together into a working image classifier — the LeNet-to-VGG lineage in roughly forty lines — and ends with a fully-connected head that turns the final thumbnail into class probabilities. Stamps plus thumbnails plus a fully-connected head, trained end-to-end. A real conv net, finally assembled.

References