Linear Regression (Forward)

Predictions as matrix multiplications.

Easy
~15 min read
·lesson 5 of 6

You have a scatter of points on a plane. A constellation, if you squint. Each point is a house — an (x, y) pair — where x is the square footage and y is the price someone paid for it. There is no formula. No rule. Just twenty dots where twenty houses happened.

Now pick up a ruler. Lay it across the cloud. Wiggle it until it sits more-or-less through the middle of the points. You've just done linear regression in your hands — a model whose entire being is a straight edge, with a slope and a vertical offset, and whose prediction for any new x is simply: look at the ruler above that x and read off the y.

That's the whole thing. The forward pass of a linear regression is laying the ruler down and reading the y values. Training — which comes next lesson — is figuring out which ruler to pick in the first place. This lesson stays inside the read. We'll go from one point at a time to all the points at once, and by the end the same gesture will scale to every dense layer in every neural network ever shipped.

Two numbers fully describe a ruler on a plane. The slope — how steeply it tilts — and where it crosses the y-axis. Those two numbers are the entire model.

univariate linear regression
ŷ   =   w · x   +   b

  • x  — the input (one feature)
  • w  — the weight: the ruler's slope
  • b  — the bias: where the ruler crosses the y-axis
  • ŷ  — the prediction: the y your ruler reads off at x

w and b are the parameters — the two dials from the previous lesson, now wearing geometry costumes. Twist w and the ruler tilts. Slide b and it lifts or drops. Every straight ruler you could ever lay across this plane corresponds to exactly one (w, b) pair. Picking a model is picking a point in this two-dimensional space of rulers.

Below is a scatter of twenty noisy points from an underlying linear relationship. Drag w and b — watch the ruler pivot and slide — and try to shrink the MSE readout by hand. The residual sticks are how much each point disagrees with the ruler's reading.

linear regression — fit by eye
ŷ = w·x + b
residual sticks turn rose when |y − ŷ| > 2
MSE3.317
n20
Linear model (personified)
I'm a ruler. I can't curve. You cannot make me fit a parabola, a sine wave, or a smile. But I am very good at being wrong in predictable, diagnosable ways — and my two knobs mean anything about me. wsqft is literally “dollars per square foot.” I am what every intro stats class teaches and what every ML engineer still reaches for when they need a sanity-check baseline before anything fancier.

Fine — we have a ruler. How do we actually read it? The obvious way: walk along the x-axis, point by point, and for each one compute w · x + b. Five houses, five multiplications, five additions. Written as a loop, it's the thing you already know how to write.

naive forward — one point at a time · one_at_a_time.py
python
# A single feature: square footage (in units of 1000). Five houses.
X = [1.2, 2.1, 3.0, 3.9, 4.8]
w = 100.0        # dollars per (1000 sqft)
b = 20.0         # base price

preds = []
for x in X:
    yhat = w * x + b         # read the ruler above x
    preds.append(yhat)

print("ŷ =", preds)
stdout
ŷ = [140.0, 230.0, 320.0, 410.0, 500.0]

No magic. Python walks the list, does the arithmetic, yields five numbers. This works for five houses. It also works for a million houses, if you're patient enough to wait out a Python loop. You will not be.

Real problems have more than one feature. A house has square footage, bedrooms, age, neighbourhood, lot size. The ruler generalises: one slope per feature, still one offset. The arithmetic is the same, just with more terms.

multivariate linear regression
ŷ   =   w₁·x₁  +  w₂·x₂  +  ...  +  w_D·x_D   +   b

     =   Σᵢ wᵢ·xᵢ   +   b          D features total

With D features the ruler doesn't fit on a whiteboard anymore — it's a flat surface (a hyperplane) sitting inside a D+1-dimensional space. You can't draw it and you wouldn't want to. But the operation is mechanically identical: multiply each feature by its weight, add them up, add the bias.

Here's the reframing that earns the lesson. Stack the inputs into a vector x and the weights into a vector w, and that whole sum w₁x₁ + w₂x₂ + … + w_Dx_D is just a dot product — one of the most common operations in all of numerical computing, and one that modern hardware will murder in its sleep.

Now do it for every house at once. Stack all N examples into a matrix X of shape (N × D) — each row is a house, each column a feature — and one matrix-vector product reads the whole constellation through the ruler in a single stroke:

batched linear regression — the matrix form
                 X         ·      w       +    b    =      ŷ
              (N × D)         (D,)         scalar      (N,)

ŷ₀   =   w₁·X₀₁  +  w₂·X₀₂  +  ...  +  w_D·X₀_D   +   b
ŷ₁   =   w₁·X₁₁  +  w₂·X₁₂  +  ...  +  w_D·X₁_D   +   b
...
ŷ_N  =   w₁·X_N1 +  w₂·X_N2 +  ...  +  w_D·X_N_D  +   b

This is what the rest of the field means by “vectorizing the forward pass.” Instead of laying the ruler down and reading each point one at a time, you press the ruler through the entire scatter in one motion — every prediction falls out simultaneously. The Python loop dies. The arithmetic happens in compiled code on contiguous memory, and on a GPU that matrix multiply becomes the single most optimized operation on the machine.

Hover any row of X in the widget below. The bottom panel shows exactly the dot product that produces that row's prediction. This is a toy housing dataset — three features (size, bedrooms, age), five houses, weights hand-picked so the numbers land somewhere believable.

the forward pass — a matrix-vector product
ŷ = X · w + b
X — inputs
sqft (×1k)bedroomsage (yr)
1.42.010.0
2.13.05.0
3.24.02.0
1.82.020.0
2.63.015.0
·
w — weights
+100
+25
-2
one per feature
+
b
20
=
ŷ — prices ($k)
190.0
295.0
436.0
210.0
325.0
row 0 · the dot-product that makes ŷ
ŷ = 1.4·100 + 2.0·25 + 10.0·-2 + 20 = 190.0
hover a row to see which dot-product becomes ŷᵢ
shape(5×3) · (3) = (5)

Three implementations, same prediction, each shorter than the last. The abstractions keep changing; the operation never does.

layer 1 — pure python · linear_forward_scratch.py
python
def linear_forward(X, w, b):
    preds = []
    for row in X:
        # The dot product, written out with no abstractions.
        yhat = sum(row[i] * w[i] for i in range(len(w))) + b
        preds.append(yhat)
    return preds

# 5 houses · 3 features (sqft/1000, bedrooms, age)
X = [
    [1.4, 2, 10],
    [2.1, 3, 5],
    [3.2, 4, 2],
    [1.8, 2, 20],
    [2.6, 3, 15],
]
w = [100, 25, -2]      # dollars-per-sqft, per-bedroom, per-year-old
b = 20                 # base price

preds = linear_forward(X, w, b)
print("ŷ =", [round(p, 1) for p in preds])
stdout
ŷ = [ 210.0, 335.0, 516.0, 190.0, 315.0 ]

This is the definition made executable. One Python loop over houses, one dot product inside, one addition for the bias. If you squint, you can see the ruler sliding across the x-axis, pausing at each house, reading the y.

Now the upgrade. NumPy lets you write the dot-product-over-batch as a single operator, and the inner arithmetic runs in compiled C over whole arrays. Same computation. No Python loop.

layer 2 — numpy · linear_forward_numpy.py
python
import numpy as np

def linear_forward(X, w, b):
    return X @ w + b               # one line, arbitrary batch, arbitrary feature count

X = np.array([
    [1.4, 2, 10],
    [2.1, 3, 5],
    [3.2, 4, 2],
    [1.8, 2, 20],
    [2.6, 3, 15],
])
w = np.array([100.0, 25.0, -2.0])
b = 20.0

print(linear_forward(X, w, b))
stdout
[210. 335. 516. 190. 315.]
pure python → numpy
sum(row[i] * w[i] for i in range(D))←→X @ w # or np.dot(X, w)

one operator for N×D @ D → N predictions

for row in X: ...←→batch is implicit

numpy does the loop over N in compiled C

The + b at the end is secretly doing something: the left-hand side is an N-vector and b is a scalar, so NumPy broadcasts b across every element. That's fine, expected, and also a rich source of bugs — more on that in the Gotcha section.

Real neural networks chain thousands of operations, and writing out the gradient of each one by hand is a job for a person with very few deadlines. PyTorch replaces the manual bookkeeping with autograd. The forward pass looks the same — the framework just also remembers what it did, so it can differentiate later.

layer 3 — pytorch · linear_forward_pytorch.py
python
import torch
import torch.nn as nn

# nn.Linear(in_features=3, out_features=1) IS linear regression.
# It owns a weight matrix (1, 3) and a bias vector (1,), both learnable.
model = nn.Linear(in_features=3, out_features=1, bias=True)

# Load our hand-picked weights directly for the demo.
with torch.no_grad():
    model.weight.copy_(torch.tensor([[100.0, 25.0, -2.0]]))
    model.bias.copy_(torch.tensor([20.0]))

X = torch.tensor([
    [1.4, 2, 10],
    [2.1, 3, 5],
    [3.2, 4, 2],
    [1.8, 2, 20],
    [2.6, 3, 15],
])

preds = model(X).squeeze(-1)
print("model's prediction:", preds)
stdout
model's prediction: tensor([210.0000, 335.0000, 516.0000, 190.0000, 315.0000],
       grad_fn=<AddBackward0>)
numpy → pytorch
X @ w + b←→nn.Linear(in_features=D, out_features=1)

packaged with learnable params + autograd

w = np.array([...]) # you own the array←→model.weight # learned during training

the Module keeps its parameters for you

one regressor←→nn.Linear(D, K) # K regressors in parallel

scale up by asking for more output features

Two traps lie in wait for everyone writing this for the first time. Both are avoidable; both will eat an afternoon if they aren't.

Manual regression, one feature

Grab a tiny dataset by hand — say, [(0, 1.1), (1, 2.9), (2, 5.1), (3, 7.0), (4, 9.1)]. Eyeball it: the relationship is roughly y = 2x + 1. Write the pure-Python linear_forward, try w = 2.0, b = 1.0, compute the predictions, and print the residuals. Can you tweak w and b by hand to drive the MSE below 0.02?

You're doing by hand what next lesson's algorithm does mechanically — nudging the ruler until the sticks shrink.

What to carry forward. A linear regression is a ruler through a point cloud; the forward pass is the ruler's reading at each x. Numerically, one prediction is a dot product of the feature vector with a weight vector, plus a bias. Batched, the whole dataset becomes one matrix multiply: X @ w + b, shapes (N × D) · (D,) = (N,). Every dense layer in every neural network is this operation at larger scale, with a non-linearity on top.

Next up — Linear Regression (Training). You can lay a ruler down. But which ruler is the best one? How do you find it? There are two answers. One is a closed-form formula from pure linear algebra that solves the problem exactly in a single matrix inversion — elegant, but it gives up the moment your dataset stops fitting in memory. The other is gradient descent on the MSE, the same algorithm you already know, now wearing a new loss. One works everywhere the other doesn't, and the contrast is the entire reason the rest of deep learning looks the way it does.

References