MLP from Scratch

A full multi-layer perceptron in pure NumPy.

Medium

~15 min read

·lesson 5 of 6

You've been buying parts. A neuron — weighted sum plus activation, one dial at a time. A loss that squashes a whole prediction down to one number that says how wrong. A softmax output that turns raw scores into probabilities that actually sum to one. A backward pass that reads the chain rule in reverse, plus the chain rule across layers version that doesn't flinch at arbitrary depth. And an update rule that nudges the dials the tiny opposite-of-the-slope amount.

Six lessons of parts in boxes. Today you bolt them together into a kit car. Forward pass is the fuel intake — data goes in, the engine squeezes it through the cylinders, a prediction comes out. Loss is combustion — the measurement of how wrong. Backward pass is the exhaust, pushed back through the system to tune the next firing. Four strokes per revolution. One epoch per drive around the block. This is the lesson where the kit becomes a car.

By the end of this page you'll have (a) a working 2-input classifier training live in your browser, (b) a visceral feel for why hidden layers are coordinate systems the network invents on the fly, and (c) forty lines of NumPy that generalize to any width and depth you hand them. No new math. Just the assembly.

Don't read anything else yet. Hit train below on the moons dataset with 8 hidden neurons. Within a second or two the decision boundary morphs from a random scribble into a clean, S-curved wall that separates the two classes. That's the kit car starting. It sputters, it catches, it runs. Pure JavaScript gradient descent, no library doing the heavy lifting — the exact algorithm you just derived, four strokes per cycle, running in real time on the page.

live MLP — watch the decision boundary form

2 → 8 (ReLU) → 1 (sigmoid)·BCE + SGD

hidden H8lr α0.30

step0

loss0.700

acc0%

Play with the knobs. Switch to circles — classic concentric rings — and try to train with H = 1. It cannot work: one hidden neuron is one line, and no single line separates a ring from its middle. Try H = 2; barely better. By H = 4 it finds a closed region. By H = 8 it's clean. By H = 32 it's overkill — faster to converge but no better at the end. That progression is what “make the network wider” means in one plot: capacity scales with width until the problem becomes solvable, then keeps scaling for nothing but speed.

Zoom in on what the hidden layer actually does. Take XOR — the four corners of a unit square, labelled in a pinwheel — which a single neuron cannot solve because no straight line separates the classes. One hidden layer with two ReLU neurons is enough. But why? The hidden layer isn't classifying. It's remapping. It bends the plane until a straight line finally works.

hidden layers warp the input space

h = ReLU(W · x + b)

input spaceXOR · not linearly separable

hidden space (h₁, h₂)separable · a line divides the classes

W₁₁1.00W₁₂1.00W₂₁-1.00W₂₂-1.00b₁-1.50b₂0.50

hidden-space separableyes

The default weights tilt and fold the XOR corners so that the two labelled-1 points end up at (0, 1) and (1, 0) in hidden space — now linearly separable. Drag any weight; watch the hidden-space points drift; watch the green dashed line lose its grip. This is the clearest statement of what “deep learning” actually is. The output layer is a linear classifier, same as lesson one. What changed is that it's classifying a learned representation of the data, not the raw data. The hidden layer is bodywork — reshape the frame, and suddenly a plain straight axle fits.

Hidden layer (personified)

I don't classify anything. I transform. I take inputs and move them to new positions — rotating, shearing, folding, clipping — so the final output layer can draw a single line and call it a day. Every fancy architecture — convnets, transformers — is a specific flavour of me with extra chrome. I am learned feature extraction.

One last thing before the code. Drag the architecture sliders and watch the parameter count move. A few intuition checks:

Depth 1, width 64, input 10, output 1: 10·64 + 64 + 64·1 + 1 ≈ 769 parameters. A go-kart.
Depth 3, width 128, input 256, output 10: jumps to ~116K — still small. A hatchback.
Depth 8, width 512, input 256: millions. This is where “big model” starts. You are no longer in the kit car section of the catalog.

architecture — params and FLOPs you can feel

params = Σ (inᵢ · outᵢ + outᵢ)·FLOPs ≈ 2 · Σ inᵢ · outᵢ

input10depth3width64output1

params9.1K

FLOPs/fwd17.8K

Time to bolt the kit together. This is the most substantial code block in the lesson — read it carefully. Every line is either setting up the chassis, running the forward pass (intake → compression → ignition → exhaust prediction), or running the backward pass (exhaust fed back to tune the next firing). Six lessons of parts, all of them showing up here at once.

layer 2 — numpy · mlp.py (full, general-depth)

python

import numpy as np

class MLP:
    """A multi-layer perceptron: inputs → [hidden-ReLU] × L → output.
       Binary output (sigmoid head + BCE loss) for simplicity; easy to swap."""

    def __init__(self, sizes, seed=0):
        rng = np.random.default_rng(seed)
        self.Ws, self.bs = [], []
        for i in range(len(sizes) - 1):
            fan_in = sizes[i]
            scale = np.sqrt(2 / fan_in)                  # He init
            self.Ws.append(rng.normal(0, scale, size=(sizes[i + 1], sizes[i])))
            self.bs.append(np.zeros(sizes[i + 1]))

    def forward(self, X):
        acts = [X]
        pres = [X]
        for l, (W, b) in enumerate(zip(self.Ws, self.bs)):
            z = acts[-1] @ W.T + b
            pres.append(z)
            if l < len(self.Ws) - 1:
                acts.append(np.maximum(0, z))            # ReLU hidden
            else:
                acts.append(1 / (1 + np.exp(-z)))         # sigmoid output
        return acts, pres

    def train_step(self, X, y, lr=0.1):
        acts, pres = self.forward(X)
        L = len(self.Ws)
        N = len(X)

        # Loss
        yhat = acts[-1].ravel()
        loss = -np.mean(y * np.log(np.clip(yhat, 1e-9, 1)) + (1 - y) * np.log(np.clip(1 - yhat, 1e-9, 1)))

        # Backward — the recurrence, written in matrix form.
        # δ at the output (sigmoid + BCE collapses to p - y)
        delta = (yhat - y).reshape(-1, 1) / N           # (N, out_dim)
        grads_W = [None] * L
        grads_b = [None] * L
        for l in range(L - 1, -1, -1):
            grads_W[l] = delta.T @ acts[l]               # (out, in)
            grads_b[l] = delta.sum(axis=0)
            if l > 0:
                # Push δ back through the weights, then mask by ReLU'
                delta = (delta @ self.Ws[l]) * (pres[l] > 0)

        # Update
        for l in range(L):
            self.Ws[l] -= lr * grads_W[l]
            self.bs[l] -= lr * grads_b[l]
        return loss

# Smoke test on XOR
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)
y = np.array([0, 1, 1, 0], dtype=float)

mlp = MLP([2, 8, 1])
for step in range(3000):
    loss = mlp.train_step(X, y, lr=0.5)
    if step % 500 == 0:
        print(f"step {step}: loss={loss:.4f}")

probs = mlp.forward(X)[0][-1].ravel()
print("predictions:", np.round(probs, 2))   # -> close to [0, 1, 1, 0]

Read it once. Read it again. And listen — this is the moment the kit car starts. Four strokes, every time, in the order that matters. Forward pulls data through the layers. The loss line measures how wrong. The backward loop walks the chain rule in reverse, one matmul per layer, pushing blame back toward the weights that caused it. The update loop nudges every dial a little opposite to its blame. Four strokes per revolution, three thousand revolutions, and XOR is solved. Rumelhart, Hinton, and Williams published this in 1986 and it fits in forty lines of NumPy. The compute has changed. The algorithm hasn't.

the two loops that matter

forward: for l: z = a @ W.T + b ; a = f(z)←→apply each layer in order

— collect pre-activations and post-activations for backward

backward: for l in reverse: δ = (δ @ W) ⊙ f'(z)←→chain rule, one matmul per layer

— param grads are δᵀ @ a — outer product

layer 3 — pytorch · mlp_pytorch.py

python

import torch
import torch.nn as nn
import torch.nn.functional as F

# The entire MLP class from the NumPy layer — now it's four lines.
class MLP(nn.Module):
    def __init__(self, sizes):
        super().__init__()
        self.layers = nn.ModuleList(
            [nn.Linear(sizes[i], sizes[i + 1]) for i in range(len(sizes) - 1)]
        )

    def forward(self, x):
        for i, layer in enumerate(self.layers[:-1]):
            x = F.relu(layer(x))
        return torch.sigmoid(self.layers[-1](x))

X = torch.tensor([[0., 0.], [0., 1.], [1., 0.], [1., 1.]])
y = torch.tensor([[0.], [1.], [1.], [0.]])

model = MLP([2, 8, 1])
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for step in range(3000):
    optimizer.zero_grad()
    yhat = model(X)
    loss = F.binary_cross_entropy(yhat, y)
    loss.backward()
    optimizer.step()
    if step % 500 == 0:
        print(f"step {step}: loss={loss.item():.4f}")

print("predictions:", torch.round(model(X), decimals=2).ravel())

stdout

step 0: loss=0.7148
step 500: loss=0.0086
predictions: [0.01, 0.99, 0.99, 0.01]

numpy → pytorch (what disappeared)

manual He-init in __init__←→nn.Linear defaults to Kaiming uniform

— same idea — PyTorch picks a reasonable default

forward: hand-coded activation branching←→for layer in self.layers[:-1]: x = F.relu(layer(x))

— cleaner loop, identical semantics

backward: 20 lines of matmuls + deltas←→loss.backward() — one line

— autograd runs the recurrence for you

Gotchas

Shape bugs are the tax. More training runs die on a transposed matrix than on any real modeling problem. If X is (N, D) and W is (out, in), then X @ W.T + b gives you (N, out) — flip a transpose and NumPy will happily broadcast nonsense. Print shapes at every layer while you're debugging. It's not elegant, but neither is pulling a head gasket on the driveway.

Output activation matches the loss. Binary classification → sigmoid + BCE. Multi-class → softmax + cross-entropy. Regression → no output activation + MSE. Mixing them (sigmoid + MSE, softmax + BCE) sometimes “trains” but gives bad gradients. The clean p − y structure in backprop only falls out when the pairing is right.

MLP output dim = number of classes, not 1. For 10-way MNIST classification, sizes = [784, 128, 10], not [784, 128, 1]. One output per class; softmax combines them into a distribution.

Biases start at zero. Not random. If you randomly initialize biases the way you do weights, each neuron begins with a baked-in preference toward one class before it's seen a single input — a carburetor that runs rich from the factory. Zero lets the gradient pick the offset the data actually wants.

Learning-rate scale is the throttle. The lr=0.5 on XOR works because XOR is a clean bowl. On real datasets 0.5 will blow the cylinder head off. Start at 1e-3 and climb until loss starts diverging, then back off by half. Same convergence math as the gradient-descent lesson — just with sharper teeth now that the loss surface has curvature.

Beat the moons dataset with as few parameters as possible

Using the NumPy MLP above, train on sklearn.datasets.make_moons(n_samples=500, noise=0.15). Your goal is > 97% accuracy on a held-out test set with the smallest total parameter count. Try [2, 4, 1] (21 params). Then [2, 8, 1] (41). Then [2, 4, 4, 1] (33 params). Which wins?

Bonus: plot the decision boundary. The deepest-narrowest network carves the most “creative” boundary; the wide-shallow network draws something more predictable. This is the depth-vs-width tradeoff showing up in a toy problem — and the reason every architecture diagram in the last decade has been a debate between tall and fat.

What to carry forward. An MLP is a stack of linear → nonlinearity → linear → … with one output head. Every hidden layer is a learned coordinate system the output head classifies against. Implementation is two loops: a forward loop that caches intermediates, a backward loop that applies the δ recurrence. Forty lines of NumPy. Four lines of PyTorch. The algorithm hasn't changed since 1986; what changed is the compute that runs it. You now own a kit car. It runs. The parts lesson is over.

Next up — Weight Initialization. Everything here quietly assumed a sensible starting point for the weights. Look at the scale = np.sqrt(2 / fan_in) line in __init__ — that's a load-bearing detail masquerading as a throwaway. It trains. But only if you start the weights right. Start them wrong and the car won't turn over no matter how many times you twist the key — activations collapse to zero, gradients vanish, the loss stays flat forever. Xavier and He init are the answers, and they're the last lesson before you leave this section.

References

[01]
Dive into Deep Learning — 5.1 Multilayer Perceptrons
Zhang, Lipton, Li, Smola · d2l.ai
[02]
Approximation by superpositions of a sigmoidal function
George Cybenko · Mathematics of Control, Signals, and Systems, 1989 — the universal approximation theorem
[03]
Neural Networks and Deep Learning
Michael Nielsen · free online book — chapter 1 covers the MLP in depth