Linear Regression (Forward)
Predictions as matrix multiplications.
You have a scatter of points on a plane. A constellation, if you squint. Each point is a house — an (x, y) pair — where x is the square footage and y is the price someone paid for it. There is no formula. No rule. Just twenty dots where twenty houses happened.
Now pick up a ruler. Lay it across the cloud. Wiggle it until it sits more-or-less through the middle of the points. You've just done linear regression in your hands — a model whose entire being is a straight edge, with a slope and a vertical offset, and whose prediction for any new x is simply: look at the ruler above that x and read off the y.
That's the whole thing. The forward pass of a linear regression is laying the ruler down and reading the y values. Training — which comes next lesson — is figuring out which ruler to pick in the first place. This lesson stays inside the read. We'll go from one point at a time to all the points at once, and by the end the same gesture will scale to every dense layer in every neural network ever shipped.
Two numbers fully describe a ruler on a plane. The slope — how steeply it tilts — and where it crosses the y-axis. Those two numbers are the entire model.
ŷ = w · x + b • x — the input (one feature) • w — the weight: the ruler's slope • b — the bias: where the ruler crosses the y-axis • ŷ — the prediction: the y your ruler reads off at x
w and b are the parameters — the two dials from the previous lesson, now wearing geometry costumes. Twist w and the ruler tilts. Slide b and it lifts or drops. Every straight ruler you could ever lay across this plane corresponds to exactly one (w, b) pair. Picking a model is picking a point in this two-dimensional space of rulers.
Below is a scatter of twenty noisy points from an underlying linear relationship. Drag w and b — watch the ruler pivot and slide — and try to shrink the MSE readout by hand. The residual sticks are how much each point disagrees with the ruler's reading.
I'm a ruler. I can't curve. You cannot make me fit a parabola, a sine wave, or a smile. But I am very good at being wrong in predictable, diagnosable ways — and my two knobs mean anything about me. wsqft is literally “dollars per square foot.” I am what every intro stats class teaches and what every ML engineer still reaches for when they need a sanity-check baseline before anything fancier.
Fine — we have a ruler. How do we actually read it? The obvious way: walk along the x-axis, point by point, and for each one compute w · x + b. Five houses, five multiplications, five additions. Written as a loop, it's the thing you already know how to write.
# A single feature: square footage (in units of 1000). Five houses.
X = [1.2, 2.1, 3.0, 3.9, 4.8]
w = 100.0 # dollars per (1000 sqft)
b = 20.0 # base price
preds = []
for x in X:
yhat = w * x + b # read the ruler above x
preds.append(yhat)
print("ŷ =", preds)ŷ = [140.0, 230.0, 320.0, 410.0, 500.0]
No magic. Python walks the list, does the arithmetic, yields five numbers. This works for five houses. It also works for a million houses, if you're patient enough to wait out a Python loop. You will not be.
Real problems have more than one feature. A house has square footage, bedrooms, age, neighbourhood, lot size. The ruler generalises: one slope per feature, still one offset. The arithmetic is the same, just with more terms.
ŷ = w₁·x₁ + w₂·x₂ + ... + w_D·x_D + b
= Σᵢ wᵢ·xᵢ + b D features totalWith D features the ruler doesn't fit on a whiteboard anymore — it's a flat surface (a hyperplane) sitting inside a D+1-dimensional space. You can't draw it and you wouldn't want to. But the operation is mechanically identical: multiply each feature by its weight, add them up, add the bias.
Here's the reframing that earns the lesson. Stack the inputs into a vector x and the weights into a vector w, and that whole sum w₁x₁ + w₂x₂ + … + w_Dx_D is just a dot product — one of the most common operations in all of numerical computing, and one that modern hardware will murder in its sleep.
Now do it for every house at once. Stack all N examples into a matrix X of shape (N × D) — each row is a house, each column a feature — and one matrix-vector product reads the whole constellation through the ruler in a single stroke:
X · w + b = ŷ
(N × D) (D,) scalar (N,)
ŷ₀ = w₁·X₀₁ + w₂·X₀₂ + ... + w_D·X₀_D + b
ŷ₁ = w₁·X₁₁ + w₂·X₁₂ + ... + w_D·X₁_D + b
...
ŷ_N = w₁·X_N1 + w₂·X_N2 + ... + w_D·X_N_D + bThis is what the rest of the field means by “vectorizing the forward pass.” Instead of laying the ruler down and reading each point one at a time, you press the ruler through the entire scatter in one motion — every prediction falls out simultaneously. The Python loop dies. The arithmetic happens in compiled code on contiguous memory, and on a GPU that matrix multiply becomes the single most optimized operation on the machine.
Hover any row of X in the widget below. The bottom panel shows exactly the dot product that produces that row's prediction. This is a toy housing dataset — three features (size, bedrooms, age), five houses, weights hand-picked so the numbers land somewhere believable.
| sqft (×1k) | bedrooms | age (yr) |
|---|---|---|
| 1.4 | 2.0 | 10.0 |
| 2.1 | 3.0 | 5.0 |
| 3.2 | 4.0 | 2.0 |
| 1.8 | 2.0 | 20.0 |
| 2.6 | 3.0 | 15.0 |
| +100 |
| +25 |
| -2 |
| 190.0 |
| 295.0 |
| 436.0 |
| 210.0 |
| 325.0 |
Three implementations, same prediction, each shorter than the last. The abstractions keep changing; the operation never does.
def linear_forward(X, w, b):
preds = []
for row in X:
# The dot product, written out with no abstractions.
yhat = sum(row[i] * w[i] for i in range(len(w))) + b
preds.append(yhat)
return preds
# 5 houses · 3 features (sqft/1000, bedrooms, age)
X = [
[1.4, 2, 10],
[2.1, 3, 5],
[3.2, 4, 2],
[1.8, 2, 20],
[2.6, 3, 15],
]
w = [100, 25, -2] # dollars-per-sqft, per-bedroom, per-year-old
b = 20 # base price
preds = linear_forward(X, w, b)
print("ŷ =", [round(p, 1) for p in preds])ŷ = [ 210.0, 335.0, 516.0, 190.0, 315.0 ]
This is the definition made executable. One Python loop over houses, one dot product inside, one addition for the bias. If you squint, you can see the ruler sliding across the x-axis, pausing at each house, reading the y.
Now the upgrade. NumPy lets you write the dot-product-over-batch as a single operator, and the inner arithmetic runs in compiled C over whole arrays. Same computation. No Python loop.
import numpy as np
def linear_forward(X, w, b):
return X @ w + b # one line, arbitrary batch, arbitrary feature count
X = np.array([
[1.4, 2, 10],
[2.1, 3, 5],
[3.2, 4, 2],
[1.8, 2, 20],
[2.6, 3, 15],
])
w = np.array([100.0, 25.0, -2.0])
b = 20.0
print(linear_forward(X, w, b))[210. 335. 516. 190. 315.]
sum(row[i] * w[i] for i in range(D))←→X @ w # or np.dot(X, w)— one operator for N×D @ D → N predictions
for row in X: ...←→batch is implicit— numpy does the loop over N in compiled C
The + b at the end is secretly doing something: the left-hand side is an N-vector and b is a scalar, so NumPy broadcasts b across every element. That's fine, expected, and also a rich source of bugs — more on that in the Gotcha section.
Real neural networks chain thousands of operations, and writing out the gradient of each one by hand is a job for a person with very few deadlines. PyTorch replaces the manual bookkeeping with autograd. The forward pass looks the same — the framework just also remembers what it did, so it can differentiate later.
import torch
import torch.nn as nn
# nn.Linear(in_features=3, out_features=1) IS linear regression.
# It owns a weight matrix (1, 3) and a bias vector (1,), both learnable.
model = nn.Linear(in_features=3, out_features=1, bias=True)
# Load our hand-picked weights directly for the demo.
with torch.no_grad():
model.weight.copy_(torch.tensor([[100.0, 25.0, -2.0]]))
model.bias.copy_(torch.tensor([20.0]))
X = torch.tensor([
[1.4, 2, 10],
[2.1, 3, 5],
[3.2, 4, 2],
[1.8, 2, 20],
[2.6, 3, 15],
])
preds = model(X).squeeze(-1)
print("model's prediction:", preds)model's prediction: tensor([210.0000, 335.0000, 516.0000, 190.0000, 315.0000],
grad_fn=<AddBackward0>)X @ w + b←→nn.Linear(in_features=D, out_features=1)— packaged with learnable params + autograd
w = np.array([...]) # you own the array←→model.weight # learned during training— the Module keeps its parameters for you
one regressor←→nn.Linear(D, K) # K regressors in parallel— scale up by asking for more output features
Two traps lie in wait for everyone writing this for the first time. Both are avoidable; both will eat an afternoon if they aren't.
Grab a tiny dataset by hand — say, [(0, 1.1), (1, 2.9), (2, 5.1), (3, 7.0), (4, 9.1)]. Eyeball it: the relationship is roughly y = 2x + 1. Write the pure-Python linear_forward, try w = 2.0, b = 1.0, compute the predictions, and print the residuals. Can you tweak w and b by hand to drive the MSE below 0.02?
You're doing by hand what next lesson's algorithm does mechanically — nudging the ruler until the sticks shrink.
What to carry forward. A linear regression is a ruler through a point cloud; the forward pass is the ruler's reading at each x. Numerically, one prediction is a dot product of the feature vector with a weight vector, plus a bias. Batched, the whole dataset becomes one matrix multiply: X @ w + b, shapes (N × D) · (D,) = (N,). Every dense layer in every neural network is this operation at larger scale, with a non-linearity on top.
Next up — Linear Regression (Training). You can lay a ruler down. But which ruler is the best one? How do you find it? There are two answers. One is a closed-form formula from pure linear algebra that solves the problem exactly in a single matrix inversion — elegant, but it gives up the moment your dataset stops fitting in memory. The other is gradient descent on the MSE, the same algorithm you already know, now wearing a new loss. One works everywhere the other doesn't, and the contrast is the entire reason the rest of deep learning looks the way it does.
- [01]Zhang, Lipton, Li, Smola · d2l.ai
- [02]Hastie, Tibshirani, Friedman · Springer, 2009
- [03]Christopher Bishop · Springer, 2006