Gradient Descent
The workhorse optimizer — derive, implement, and visualize it.
Imagine you're blindfolded. Someone drops you on a hillside at night and tells you to walk to the bottom of the valley. You can't see the hill. You can't see the valley. The only thing your body has to work with is the slope of the ground under your feet.
What do you do? You shuffle a little, feel which way the ground falls away fastest, step that way, and repeat. When every direction feels flat, you're at the bottom. Or, honestly, you're at a bottom — but we'll get to that disappointment later.
That's gradient descent. Whole algorithm. Before any math, before any code, before any neural network — that's the thing we're doing. The rest of this page is what the blindfold, the hill, and the shuffle become when the hill is the inside of a model.
The hill is the loss. Each spot on the hillside corresponds to one setting of the model's dials, and the height at that spot is how wrong the model is when its dials are set that way. Low ground = small loss = model gets answers right. The blindfolded shuffle is training: you can't see the whole hill — it lives in millions of dimensions, one per dial — so you feel your way down it.
dials (θ) ──▶ MODEL ──▶ prediction
│
right answer ──┴──▶ compare ──▶ LOSS (one number)
training loop:
1. feel the slope under your feet ← gradient
2. step the opposite way, a little ← update rule
3. repeat until the ground is flat ← doneFrom here on: when you read parameter, picture a dial. When you read gradient, picture the way the ground slopes under your blindfolded foot. When you read learning rate, picture the size of the step you're willing to take before stopping to feel the ground again. Everything else is detail.
Let's make the problem concrete. You have a model — say, one that looks at an image and predicts whether it's a cat. Fresh out of the box, it guesses “cat” for everything: dogs, trees, your lunch. The loss is enormous. The model, to use a technical term, is cooked.
You could tweak the dials by hand. There are a hundred million of them. You'll finish sometime in the next thousand years, assuming you don't stop to eat.
Or you could do what every neural network you've ever heard of — GPT-4, DALL·E, AlphaFold, the thing that picks your next YouTube video — does, which is: shuffle. One loop, running billions of times a second, each step asking the same question: am I going the right direction?
That's gradient descent. Genuinely, almost insultingly simple. Most tutorials draw an arrow pointing downhill, write w ← w − α∇L, and walk away. You'll leave this page with something better: a feel for the algorithm in your fingers. Drop a marble on a loss surface. Crank the learning rate until your model detonates. Scrub through every iteration by hand. Then — and only then — write the code.
Start here. The hero widget below is a 3D loss surface. Click anywhere to place a marble, then press release. Tilt your view, drop marbles in different places, slide α around. The rest of the lesson is an unpacking of what you see here.
A bowl-shaped surface. A marble. The marble feels the slope right where it is and rolls downhill. Same hiker as before — just well-animated, and lit. The math you're about to see is a marble on a surface, with the word “marble” replaced by parameter vector and the word “surface” replaced by loss function. No metaphor is being smuggled.
The marble only ever controls two things. First: which direction to step. Second: how far. Meet them.
I point uphill. That's my whole personality. If you want down, step the opposite way — and please don't ask me to plan more than one step ahead, I can only tell you about right here, right now.
Set me too high and I'll blow up your model. Set me too low and your model will still be training when the sun burns out. There is no correct value for me in the abstract — only a correct value for your loss surface. Good luck.
The gradient tells you which way. The learning rate tells you how far. You want the biggest step you can take without eating the pavement — think Indy, sprinting down the corridor with the boulder on his heels, one bad stride from being part of the floor. Every other optimizer on the planet — SGD, Adam, RMSProp, momentum, Adagrad — is an elaboration of those two choices. Get the intuition right here and the rest will feel like footnotes.
Before any math, let's put that personification under a microscope. Strip the surface from 3D down to 1D — same idea, but now you can watch the marble step back and forth along a line and see what the learning rate actually does.
Drag α upward slowly. Around α = 0.5 the marble stops settling and starts bouncing — every step overshoots the bottom. Push past α = 1 and it launches off into the void — our blindfolded hiker just tripped over their own feet. There's a hard convergence condition lurking here, and you just found it by feel. Now we find it by math.
Pick the simplest possible loss: f(x) = x². A bowl with its bottom at x = 0. Everything we need follows from two facts.
f(x) = x² f'(x) = 2x
At any point x, f'(x) is the slope — positive means climbing right, negative means climbing left. This is the number the hiker feels under their foot. The update rule says move opposite to the slope:
x_new = x_old − α · f'(x_old)
= x_old − α · 2 · x_old
= x_old · (1 − 2α)Look at the last line of (1.1). Every step multiplies x by the same constant. So after n steps starting from x₀:
x_n = x_0 · (1 − 2α)^n
This is a geometric sequence — the math version of the thing you already saw happen. (1.2) decays to zero iff |1 − 2α| < 1, i.e. 0 < α < 1. Below α = 0.5 the multiplier is positive — every step is the same sign, the hiker walks straight down. Between 0.5 and 1 the multiplier flips negative — they overshoot, land on the far side, overshoot less, repeat. Zigzag, but converging. At exactly α = 1, |1 − 2α| = 1 — they bounce between x₀ and −x₀ forever, a tennis match with no umpire. Past α = 1, divergence. The hiker is now airborne. Flip back up and try those thresholds in the widget — the math says exactly what your eyes saw.
α too high (e.g. 0.6): (1 − 2·0.6) = −0.2, so x flips sign every step. Still converges — just zigzags.
α ≥ 1: the multiplier exceeds 1 in absolute value. Each step overshoots by more than it corrects. x flies off to infinity. The model is cooked.
Real networks: f(x) = x² is the friendliest loss in existence. Real loss surfaces have curvature that varies wildly across parameters, which is why a single global α almost never works and why adaptive optimizers (Adam and friends) exist.
The formula says x₂₅ = 5 · 0.8²⁵ ≈ 0.0189. That's true in the same way “Paris is the capital of France” is true — correct and utterly unconvincing until you've been there. Scrub through 25 steps below. Watch x shrink; watch the loss collapse. The two numbers the marble sees at each step are just x and f(x). Nothing else.
| step | x | f(x) |
|---|---|---|
| 0 | 5.0000 | 25.0000 |
| 1 | 4.0000 | 16.0000 |
| 2 | 3.2000 | 10.2400 |
| 3 | 2.5600 | 6.5536 |
| 4 | 2.0480 | 4.1943 |
| 5 | 1.6384 | 2.6844 |
| 6 | 1.3107 | 1.7180 |
| 7 | 1.0486 | 1.0995 |
| 8 | 0.8389 | 0.7037 |
| 9 | 0.6711 | 0.4504 |
| 10 | 0.5369 | 0.2882 |
| 11 | 0.4295 | 0.1845 |
| 12 | 0.3436 | 0.1181 |
| 13 | 0.2749 | 0.0756 |
| 14 | 0.2199 | 0.0484 |
| 15 | 0.1759 | 0.0309 |
| 16 | 0.1407 | 0.0198 |
| 17 | 0.1126 | 0.0127 |
| 18 | 0.0901 | 0.0081 |
| 19 | 0.0721 | 0.0052 |
| 20 | 0.0576 | 0.0033 |
| 21 | 0.0461 | 0.0021 |
| 22 | 0.0369 | 0.0014 |
| 23 | 0.0295 | 0.0009 |
| 24 | 0.0236 | 0.0006 |
| 25 | 0.0189 | 0.0004 |
You've seen it move. You've seen the math. Now write it three times, each shorter than the last — and the third one trains real neural networks. Pure Python, NumPy, PyTorch. Every production training script in the world is one of those three with a million more lines of bookkeeping wrapped around it.
def gradient_descent_scratch(init, learning_rate, iterations):
x = init # start at our initial position
for step in range(iterations):
gradient = 2 * x # f'(x) = 2x — the slope at current x
x = x - learning_rate * gradient # x_new = x_old - a * f'(x_old)
return x
result = gradient_descent_scratch(init=5.0, learning_rate=0.1, iterations=25)
print(f"After 25 steps: {result:.5f}")After 25 steps: 0.01889
The line x = x - learning_rate * gradient is the update rule from the math. Nothing is hidden. Matches the scrubber's final value to the digit.
One parameter is cute. Real models have billions. Looping in pure Python over a billion parameters per step would finish training sometime around the heat death of the sun. Enter NumPy — same loop, but the inner arithmetic runs in compiled C on whole vectors at once.
import numpy as np # NumPy — Python's numerical backbone. Vectorised arithmetic in C.
def gd_multi_numpy(theta_init, learning_rate, iterations):
theta = np.array(theta_init, dtype=float) # wrap the list in a vector
for _ in range(iterations):
gradient = 2 * theta # operates on ALL elements simultaneously
theta = theta - learning_rate * gradient
return theta
result = gd_multi_numpy([5.0, -3.0, 2.0], learning_rate=0.1, iterations=25)
print(np.round(result, 4))
# -> [ 0.0189 -0.0113 0.0076]gradients = [2 * t for t in theta]←→gradient = 2 * theta— one operation, all elements — broadcasting
theta = [t - lr * g for t, g in zip(...)]←→theta = theta - lr * gradient— vector subtraction, no Python loop
NumPy knows what 2 * theta means because the gradient of Σ θᵢ² is closed-form. Real networks chain thousands of operations — each layer's output is the input to the next, and each operation contributes its own slope to the final slope. That stacking of slopes has a name (the chain rule, coming up in the Backpropagation lesson), and it's why you can't hardcode the gradient of GPT-4 by hand. That's why PyTorch exists: it computes gradients automatically.
import torch
import torch.optim as optim # every optimizer lives here — SGD, Adam, RMSProp, all of them
# requires_grad=True tells PyTorch to track this tensor for automatic differentiation.
theta = torch.tensor([5.0, -3.0, 2.0], requires_grad=True)
# optim.SGD is the packaged version of our update rule: theta -= lr * gradient.
optimizer = optim.SGD([theta], lr=0.1)
for step in range(25):
optimizer.zero_grad() # PyTorch accumulates gradients — reset each step
loss = (theta ** 2).sum() # f(theta) = sum(theta_i ** 2)
loss.backward() # compute gradients automatically (autograd)
optimizer.step() # apply: theta = theta - lr * gradient
print(torch.round(theta.detach(), decimals=4))tensor([ 0.0189, -0.0113, 0.0076])
loss = (theta ** 2).sum()←→f(θ) = Σ θᵢ²— same objective, defined in Python
loss.backward()←→gradient = 2 * theta— we hardcoded this — autograd derives it from the loss expression
optimizer.step()←→theta = theta - lr * gradient— identical update, packaged as a single call
Now the hard part. Everything so far has been f(x) = x² — a cooperative bowl with a single minimum and smooth curvature everywhere. Your intuition has been trained on the friendliest function that exists.
Real loss surfaces are not cooperative. They have local minima — small bowls nested inside the big one. They have saddle points, where the ground goes up in one direction and down in another — the hiker's foot feels flat on average, even though they're nowhere near a valley floor. They have plateaus where progress dies for hundreds of steps at a time. Drop the marble here, then there, then somewhere else — and watch gradient descent give you a different answer every time.
Same algorithm. Same α. Different starting points, wildly different final answers. That's not a bug in gradient descent — it's the nature of non-convex optimization, and it's why “initialization” is a real line item in every modern training pipeline. The marble only ever sees the local slope, so where you drop it matters.
The honest takeaway is slightly unsettling: in a big enough neural net, you're never really finding the minimum. You're finding a minimum. Fortunately, a weird empirical fact rescues us — in high dimensions, the local minima of realistic loss surfaces tend to have nearly identical loss values. The surface is messy, but most of the messy spots are about equally good. Hundreds of valleys, each one a fine place to stop. More on that when we get to the training chapter.
One more thing vanilla gradient descent is terrible at: narrow ravines. Long thin valleys where the slope across the valley is huge and the slope along it is tiny. The marble burns every step ping-ponging across the ravine and barely any traveling forward. The cure is inertia — give the marble momentum so it smooths out the zigzag and picks up speed along the valley floor. Watch the two of them side by side.
Blue is vanilla. Gold is momentum. Same start, same α — one of them ends near the minimum, the other is still zigzagging. The whole zoo of modern optimizers (Adam, RMSProp, Adafactor) are variations on the same idea: carry state across steps so a single noisy gradient can't knock you off course. We'll build momentum from scratch in a later lesson. For now just note: this is what people mean when they say “SGD with momentum.”
Bump lr past 0.5. Within a handful of steps x starts bouncing across zero. Push it past 1.0 and it runs away to infinity. That's the convergence condition from (1.2) — |1 − 2α| < 1 — enforcing itself in real code.
Try it. Change lr, re-run, then graph trail mentally against the rule.
# Vanilla gradient descent on f(x) = x^2. Starts at x = 5.
# Challenge: find the largest lr for which |x_25| < 1e-3.
lr = 0.1 # try 0.5, 0.9, 1.0, 1.1 — watch the regime flip
steps = 25
x = 5.0
trail = [x]
for _ in range(steps):
grad = 2 * x
x = x - lr * grad
trail.append(x)
print(f"lr = {lr}")
print(f"final x = {x:+.5f}")
print(f"max |x| seen = {max(abs(v) for v in trail):+.5f}")
print(f"converged? = {abs(x) < 1e-3}")What to carry forward. Gradient descent is the loop inside every training algorithm in ML — not optional, load-bearing. The learning rate isn't just a hyperparameter; it's a convergence condition, and breaking it breaks the model completely. The three-layer progression — pure Python, NumPy, PyTorch — is the same progression we'll use for every algorithm in this series: see the mechanics, scale them up, then cede them to the library. Every layer above is a shortcut for a layer below, never magic.
Next up — Sigmoid & ReLU. Gradient descent needs a function to differentiate. Inside a neural net that function is a stack of matrix multiplies with a little shape-bending non-linearity wedged between each layer — and the whole thing collapses into a single straight line without those non-linearities. The two most common are sigmoid and ReLU. Small, humble, and they decide what “firing” means for a neuron. Their derivatives plug directly into the update rule you just learned — and one of them has a bad habit of murdering gradients entirely. We'll find out which, and why.
Natural continuations that build directly on this.
You triple the learning rate on a model that was training fine at lr=0.01. The loss shoots up, then to NaN. What's the mechanical story?
- [01]Zhang, Lipton, Li, Smola · d2l.ai
- [02]Goodfellow, Bengio, Courville · MIT Press, 2016
- [03]Sebastian Ruder · 2016
- [04]Li, Xu, Taylor, Studer, Goldstein · NeurIPS 2018