Linear Regression (Training)

Closed-form vs iterative — when each wins.

Medium

~15 min read

·lesson 6 of 6

You have a dataset. You picked a linear model. Forward pass was one matrix multiply — plug x in, get ŷ out. Fine. But the weights inside that multiply are still whatever you initialized them to, which is to say: wrong. This lesson is about making them right.

Here's the thing almost no other algorithm in this curriculum gets to say. Linear regression has two doors. Door one: the shortcut. Solve a matrix equation, get the optimal weights, done. No iteration. No learning rate. No hyperparameters to babysit. Door two: the long walk. Start with weights, compute the gradient, step downhill, repeat a few dozen times. That's gradient descent, exactly as you built it.

Both doors lead to the same room. So why does this curriculum spend fourteen more sections teaching you the long walk? Because the shortcut has a catch, and the catch is the reason deep learning exists. We'll get there. First, the problem.

You have a ruler. Actually, you have infinitely many — every choice of (w, b) defines a different line through your data. You need a number that says which one is best, so you can rank them. The honest first try: square every prediction error, take the average. Big errors get penalised more than small ones, negatives can't cancel positives, the whole thing is differentiable. That number is the mean squared error:

mean squared error as a function of (w, b)

               1       N
L(w, b)   =   ───   Σ   (yᵢ  −  (w·xᵢ  +  b))²
               N    i=1

Each term is a square. Sum of squares in two variables is a convex quadratic — which, geometrically, is a bowl. One bottom. No local minima pretending to be the real one. No saddle points staging ambushes. Just a clean downhill from every direction to a single point. The point is the answer.

Click anywhere on the bowl to drop a marble. Hit release and gradient descent rolls it toward the purple pin — the optimum. Drop the marble in five different places and it finds the same pin every time. That's what convexity buys you.

MSE loss bowl — loading…

initializing WebGL scene…

Door one. The shortcut. Here's why a bowl lets you skip iteration entirely: the minimum of a smooth bowl is wherever the gradient vanishes. Set ∇L = 0, solve for the parameters, and you're standing at the bottom. For a quadratic loss that equation is linear in the parameters, which means it has a clean algebraic solution. You don't need to walk — you just read the coordinates off the page.

Stack your data into X ∈ ℝ^(N × D) (append a column of ones so the bias folds into the weights), targets into y ∈ ℝ^N, parameters into θ ∈ ℝ^D. The loss and its gradient:

MSE in matrix form

                      ||Xθ − y||²
L(θ)   =   (1/N) · ─────────────────

∇L(θ)  =   (2/N) · Xᵀ (Xθ − y)

Set that gradient to zero. Two lines of linear algebra and you have the exact optimum in closed form:

the normal equations

0    =   Xᵀ X θ*   −   Xᵀ y

Xᵀ X θ*   =   Xᵀ y

θ*   =   (Xᵀ X)⁻¹   Xᵀ y

That is not a step toward the answer. That is the answer. One matrix inversion, one multiply, the best weights your model can achieve on this data. No α, no convergence check, no loop. Statistics textbooks have taught this expression since Gauss wrote it down in 1795, which means your great-great-great-grandfather could have trained linear regression, assuming he had a lot of graph paper.

closed-form OLS · ols_closed_form.py

python

import numpy as np

# Data — same 8 points as the widget.
x = np.array([-1.8, -1.2, -0.5, 0.1, 0.9, 1.4, 2.1, 2.6])
y = np.array([-2.1, -1.3, -0.4, 0.6, 1.7, 2.2, 3.5, 4.1])

# Prepend a column of 1s so bias folds into theta.
X = np.stack([x, np.ones_like(x)], axis=1)   # shape (N, 2)

# The normal equations — one line:
theta = np.linalg.inv(X.T @ X) @ X.T @ y      # (D, D) @ (D, N) @ (N,) = (D,)

print(f"theta = {np.round(theta, 4)}")        # [slope, intercept]

preds = X @ theta
loss = np.mean((preds - y) ** 2)
print(f"loss  = {loss:.4f}")

stdout

theta = [ 1.4857  0.2964]
loss  = 0.0823

Closed form (personified)

I am exact. I do not iterate. I give you the unique minimum in one step, and I am embarrassingly easy to write. I am also secretly O(D³) because inverting an X^TX matrix costs a lot once D is big. Oh — and please don't invert badly-conditioned matrices. I may hand back numerical garbage and smile.

Gotchas

Never actually call np.linalg.inv in production. Use np.linalg.solve(X.T @ X, X.T @ y) or even better np.linalg.lstsq(X, y, rcond=None). They're more stable and faster. Computing the inverse as a standalone step is almost always wrong — you want the solution, not the inverse.

X^TX might not be invertible. If you have fewer data points than features, or redundant features, the matrix is singular and the inverse doesn't exist. lstsq handles that gracefully; naive inv crashes.

Door two. The long walk. Same dataset, same loss, same destination — reached one foot at a time. Start the weights at zero. Compute the gradient. Take a small step opposite to it. Repeat. You already know every line of this from the gradient descent lesson; the only new thing is that the loss surface is now the MSE bowl you just met.

Drag α or press play. The left panel shows the fitted line sweeping toward the dashed optimum. The right panel shows the loss dropping toward L* — the minimum loss the shortcut already computed. Both doors, one room.

gradient descent finds what closed form already knows

α · ∇L · step·dashed line = OLS optimum

data + current line

loss over steps

α0.12step 0/60

loss8.006

gap to opt7.998

Both doors opened. Both led to the same weights. If the shortcut gives you the exact answer in one line, why would anyone walk?

Because the doors do not cost the same. The shortcut computes Xᵀ X (which is O(N · D²)) and then inverts a D × D matrix (which is O(D³)). That D³ is the fine print on the elegant expression — for a few dozen features it's a rounding error, for a few thousand it's your afternoon, for a million it's geological time. The long walk has different arithmetic: a fixed number of iterations, each one O(N · D). Linear in D, not cubic. Scale D and the regimes flip.

the regimes — when each algorithm wins

assumes 100-iter GD · 1 GFLOP/s budget

closed form10.1%

101.0 µs · 101.0K ops

gradient descent100.0%

1.0 ms · 1.0M ops

Small D · closed form is both fastest and exact. This is the regime your intro stats course lives in.

N1000D10

winnerclosed form

Push the sliders. A stats-class dataset — 100 rows, 5 features — and the shortcut wins by five orders of magnitude. It's essentially free. A recommender system with a million users and ten thousand features, and the shortcut is infeasible: inverting a 10k×10k matrix takes hours of CPU time and enough memory to matter. Same algorithm, same math, just way off the edge of what the D³ regime can absorb.

And now the real punchline. Everything above assumed the loss was a bowl. That assumption is load-bearing — setting ∇L = 0 only gives you a clean, closed-form answer because the equation you get is linear. For any model more complicated than a linear regressor, that stops being true. Add a sigmoid. Add a second layer. Add a transformer on top. The loss is no longer quadratic in the parameters, ∇L = 0 no longer has an algebraic solution, and the landscape is no longer a bowl — it's a mountain range with ravines and saddle points, the one you saw bully gradient descent around in the last lesson.

So the shortcut isn't just expensive outside linear regression. It disappears. Every neural network you've ever heard of is trained by the long walk — not because it's clever, but because it's the only door left. Linear regression is the farewell tour for door one. Say goodbye nicely.

Three implementations of the long walk on linear regression. Same problem you just watched in the widget, now on the page — in the three voices you're using for every algorithm in this series.

layer 1 — pure python · linear_gd_scratch.py

python

def mse(theta, data):
    loss = 0.0
    for x, y in data:
        yhat = theta[0] * x + theta[1]       # w*x + b
        loss += (y - yhat) ** 2
    return loss / len(data)

def linear_gd(data, lr=0.1, steps=50):
    theta = [0.0, 0.0]                       # start at (w=0, b=0)
    for step in range(steps + 1):
        gw = 0.0; gb = 0.0
        for x, y in data:
            r = y - (theta[0] * x + theta[1])
            gw += -2 * x * r
            gb += -2 * r
        gw /= len(data); gb /= len(data)
        if step % 10 == 0:
            print(f"step {step:2d}: L = {mse(theta, data):.4f}")
        theta[0] -= lr * gw
        theta[1] -= lr * gb
    return theta

data = [(-1.8, -2.1), (-1.2, -1.3), (-0.5, -0.4), (0.1, 0.6),
        (0.9, 1.7), (1.4, 2.2), (2.1, 3.5), (2.6, 4.1)]
theta = linear_gd(data)
print(f"theta = {[round(v, 3) for v in theta]}")

stdout

step  0: L = 7.4225
step 10: L = 0.6181
step 50: L = 0.0827
theta = [1.483, 0.296]

layer 2 — numpy · linear_gd_numpy.py

python

import numpy as np

def linear_gd(X, y, lr=0.1, steps=50):
    N, D = X.shape
    theta = np.zeros(D)
    for _ in range(steps):
        residual = X @ theta - y              # (N,)
        grad = (2 / N) * X.T @ residual       # (D,)
        theta -= lr * grad
    return theta

x = np.array([-1.8, -1.2, -0.5, 0.1, 0.9, 1.4, 2.1, 2.6])
y = np.array([-2.1, -1.3, -0.4, 0.6, 1.7, 2.2, 3.5, 4.1])
X = np.stack([x, np.ones_like(x)], axis=1)

theta = linear_gd(X, y)
print(f"theta = {np.round(theta, 4)}")
# -> [1.4857 0.2963]  — exactly what closed-form gave us

pure python → numpy

for x, y in data: gw += -2*x*(y - (w*x+b))←→grad = (2/N) * X.T @ (X @ theta - y)

— the single matrix form for both weights and bias

theta[0] -= lr*gw ; theta[1] -= lr*gb←→theta -= lr * grad

— vectorised update — scales to any D

layer 3 — pytorch · linear_gd_pytorch.py

python

import torch
import torch.nn as nn
import torch.optim as optim

x = torch.tensor([-1.8, -1.2, -0.5, 0.1, 0.9, 1.4, 2.1, 2.6]).unsqueeze(-1)
y = torch.tensor([-2.1, -1.3, -0.4, 0.6, 1.7, 2.2, 3.5, 4.1]).unsqueeze(-1)

model = nn.Linear(1, 1)                           # w and b, learnable
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

for step in range(51):
    optimizer.zero_grad()
    yhat = model(x)                               # forward
    loss = criterion(yhat, y)                     # MSE
    loss.backward()                               # autograd — no hand-rolled gradient
    optimizer.step()                              # θ ← θ - α·∇L
    if step % 10 == 0:
        print(f"step {step}: loss = {loss.item():.4f}")

print(model.weight)
print(model.bias)

stdout

step 10: loss = 0.1487
step 50: loss = 0.0822
Parameter containing:
tensor([[1.4855]], requires_grad=True)
Parameter containing:
tensor([0.2964], requires_grad=True)

numpy → pytorch

grad = (2/N) * X.T @ (X@theta - y)←→loss.backward()

— autograd computes the same thing from the MSELoss expression

theta -= lr * grad←→optimizer.step()

— same update, packaged — swap SGD for Adam to upgrade the optimizer

np.stack([x, ones], axis=1)←→nn.Linear(1, 1, bias=True)

— bias is a first-class parameter, no manual 1-column needed

Beat closed form with gradient descent

Generate a synthetic linear dataset with N = 10,000 examples and D = 500 features. Solve it two ways: (1) np.linalg.lstsq and (2) a gradient-descent loop for 100 steps. Time both. The closed form should take a few seconds (the D³ inversion). Your well-tuned GD should finish in milliseconds and get within 1e-6 of the same solution.

Bonus: now set N = 10,000, D = 50,000 — which is still linear regression but with more features than examples. Closed form will refuse (the X^TX matrix is rank-deficient). Gradient descent still converges (albeit to one of infinitely many equally-good minima). You have just re-derived the modern over-parameterised regime in twelve lines of NumPy.

What to carry forward. Linear regression is the one model in this course where both doors are open. The shortcut — θ* = (XᵀX)⁻¹ Xᵀy — exists because the loss is a convex quadratic and ∇L = 0 has a clean algebraic solution. It costs O(N·D² + D³), which is cheap for small D and catastrophic for large D. The long walk costs O(iters·N·D), wins as soon as features outnumber a few thousand, and — more importantly — still works when the loss isn't a bowl. Every deep model you'll meet from here on has a non-convex loss, which means the shortcut is gone and the long walk is all that's left.

Next up — Single Neuron. The shortcut's gone. From here on, you walk. Before you can walk through a deep network, you need the smallest non-trivial thing that sits inside one: a single neuron. A weighted sum, a nonlinearity, a prediction — the unit of computation that everything from an MLP to a transformer stacks, by the million, into a model. And to train a stack of them, you'll need to know how the gradient flows backwards through each layer. That's the next two lessons.

References

[01]
Dive into Deep Learning — 3.1.2 Solution by Normal Equations
Zhang, Lipton, Li, Smola · d2l.ai
[02]
The Elements of Statistical Learning — 3.2.2 Linear Least Squares
Hastie, Tibshirani, Friedman · Springer, 2009
[03]
An overview of gradient descent optimization algorithms
Sebastian Ruder · 2016