Gradient Descent

The workhorse optimizer — derive, implement, and visualize it.

Easy
~15 min read
·lesson 1 of 6

Imagine you're blindfolded. Someone drops you on a hillside at night and tells you to walk to the bottom of the valley. You can't see the hill. You can't see the valley. The only thing your body has to work with is the slope of the ground under your feet.

What do you do? You shuffle a little, feel which way the ground falls away fastest, step that way, and repeat. When every direction feels flat, you're at the bottom. Or, honestly, you're at a bottom — but we'll get to that disappointment later.

That's gradient descent. Whole algorithm. Before any math, before any code, before any neural network — that's the thing we're doing. The rest of this page is what the blindfold, the hill, and the shuffle become when the hill is the inside of a model.

The hill is the loss. Each spot on the hillside corresponds to one setting of the model's dials, and the height at that spot is how wrong the model is when its dials are set that way. Low ground = small loss = model gets answers right. The blindfolded shuffle is training: you can't see the whole hill — it lives in millions of dimensions, one per dial — so you feel your way down it.

       dials (θ)  ──▶  MODEL  ──▶  prediction
                                      │
                       right answer ──┴──▶  compare  ──▶  LOSS (one number)


       training loop:
         1. feel the slope under your feet   ←  gradient
         2. step the opposite way, a little  ←  update rule
         3. repeat until the ground is flat  ←  done
the whole lesson, in a diagram

From here on: when you read parameter, picture a dial. When you read gradient, picture the way the ground slopes under your blindfolded foot. When you read learning rate, picture the size of the step you're willing to take before stopping to feel the ground again. Everything else is detail.

Let's make the problem concrete. You have a model — say, one that looks at an image and predicts whether it's a cat. Fresh out of the box, it guesses “cat” for everything: dogs, trees, your lunch. The loss is enormous. The model, to use a technical term, is cooked.

You could tweak the dials by hand. There are a hundred million of them. You'll finish sometime in the next thousand years, assuming you don't stop to eat.

Or you could do what every neural network you've ever heard of — GPT-4, DALL·E, AlphaFold, the thing that picks your next YouTube video — does, which is: shuffle. One loop, running billions of times a second, each step asking the same question: am I going the right direction?

That's gradient descent. Genuinely, almost insultingly simple. Most tutorials draw an arrow pointing downhill, write w ← w − α∇L, and walk away. You'll leave this page with something better: a feel for the algorithm in your fingers. Drop a marble on a loss surface. Crank the learning rate until your model detonates. Scrub through every iteration by hand. Then — and only then — write the code.

Start here. The hero widget below is a 3D loss surface. Click anywhere to place a marble, then press release. Tilt your view, drop marbles in different places, slide α around. The rest of the lesson is an unpacking of what you see here.

3D loss surface — loading…
initializing WebGL scene…

A bowl-shaped surface. A marble. The marble feels the slope right where it is and rolls downhill. Same hiker as before — just well-animated, and lit. The math you're about to see is a marble on a surface, with the word “marble” replaced by parameter vector and the word “surface” replaced by loss function. No metaphor is being smuggled.

The marble only ever controls two things. First: which direction to step. Second: how far. Meet them.

Gradient (personified)
I point uphill. That's my whole personality. If you want down, step the opposite way — and please don't ask me to plan more than one step ahead, I can only tell you about right here, right now.
Learning rate (personified)
Set me too high and I'll blow up your model. Set me too low and your model will still be training when the sun burns out. There is no correct value for me in the abstract — only a correct value for your loss surface. Good luck.

The gradient tells you which way. The learning rate tells you how far. You want the biggest step you can take without eating the pavement — think Indy, sprinting down the corridor with the boulder on his heels, one bad stride from being part of the floor. Every other optimizer on the planet — SGD, Adam, RMSProp, momentum, Adagrad — is an elaboration of those two choices. Get the intuition right here and the rest will feel like footnotes.

Before any math, let's put that personification under a microscope. Strip the surface from 3D down to 1D — same idea, but now you can watch the marble step back and forth along a line and see what the learning rate actually does.

learning rate — convergence and divergence
f(x) = x² · side view
x5.000
f(x)25.000

Drag α upward slowly. Around α = 0.5 the marble stops settling and starts bouncing — every step overshoots the bottom. Push past α = 1 and it launches off into the void — our blindfolded hiker just tripped over their own feet. There's a hard convergence condition lurking here, and you just found it by feel. Now we find it by math.

Pick the simplest possible loss: f(x) = x². A bowl with its bottom at x = 0. Everything we need follows from two facts.

derivative of the loss
f(x)   =  x²
f'(x)  =  2x

At any point x, f'(x) is the slope — positive means climbing right, negative means climbing left. This is the number the hiker feels under their foot. The update rule says move opposite to the slope:

gradient descent update on f(x) = x²
x_new  =  x_old  −  α · f'(x_old)
       =  x_old  −  α · 2 · x_old
       =  x_old · (1 − 2α)
(1.1)

Look at the last line of (1.1). Every step multiplies x by the same constant. So after n steps starting from x₀:

closed form after n steps
x_n  =  x_0 · (1 − 2α)^n
(1.2)

This is a geometric sequence — the math version of the thing you already saw happen. (1.2) decays to zero iff |1 − 2α| < 1, i.e. 0 < α < 1. Below α = 0.5 the multiplier is positive — every step is the same sign, the hiker walks straight down. Between 0.5 and 1 the multiplier flips negative — they overshoot, land on the far side, overshoot less, repeat. Zigzag, but converging. At exactly α = 1, |1 − 2α| = 1 — they bounce between x₀ and −x₀ forever, a tennis match with no umpire. Past α = 1, divergence. The hiker is now airborne. Flip back up and try those thresholds in the widget — the math says exactly what your eyes saw.

Gotchas

α too high (e.g. 0.6): (1 − 2·0.6) = −0.2, so x flips sign every step. Still converges — just zigzags.

α ≥ 1: the multiplier exceeds 1 in absolute value. Each step overshoots by more than it corrects. x flies off to infinity. The model is cooked.

Real networks: f(x) = x² is the friendliest loss in existence. Real loss surfaces have curvature that varies wildly across parameters, which is why a single global α almost never works and why adaptive optimizers (Adam and friends) exist.

The formula says x₂₅ = 5 · 0.8²⁵ ≈ 0.0189. That's true in the same way “Paris is the capital of France” is true — correct and utterly unconvincing until you've been there. Scrub through 25 steps below. Watch x shrink; watch the loss collapse. The two numbers the marble sees at each step are just x and f(x). Nothing else.

25 steps by hand
x₀ = 5·α = 0.1
stepxf(x)
05.000025.0000
14.000016.0000
23.200010.2400
32.56006.5536
42.04804.1943
51.63842.6844
61.31071.7180
71.04861.0995
80.83890.7037
90.67110.4504
100.53690.2882
110.42950.1845
120.34360.1181
130.27490.0756
140.21990.0484
150.17590.0309
160.14070.0198
170.11260.0127
180.09010.0081
190.07210.0052
200.05760.0033
210.04610.0021
220.03690.0014
230.02950.0009
240.02360.0006
250.01890.0004
step
00
x
5.00000
f(x) — loss
25.00000
loss decay

You've seen it move. You've seen the math. Now write it three times, each shorter than the last — and the third one trains real neural networks. Pure Python, NumPy, PyTorch. Every production training script in the world is one of those three with a million more lines of bookkeeping wrapped around it.

layer 1 — pure python · gradient_descent_scratch.py
def gradient_descent_scratch(init, learning_rate, iterations):
    x = init                                   # start at our initial position
    for step in range(iterations):
        gradient = 2 * x                       # f'(x) = 2x — the slope at current x
        x = x - learning_rate * gradient       # x_new = x_old - a * f'(x_old)
    return x

result = gradient_descent_scratch(init=5.0, learning_rate=0.1, iterations=25)
print(f"After 25 steps: {result:.5f}")
stdout
After 25 steps: 0.01889

The line x = x - learning_rate * gradient is the update rule from the math. Nothing is hidden. Matches the scrubber's final value to the digit.

One parameter is cute. Real models have billions. Looping in pure Python over a billion parameters per step would finish training sometime around the heat death of the sun. Enter NumPy — same loop, but the inner arithmetic runs in compiled C on whole vectors at once.

layer 2 — numpy · gd_multi_numpy.py
import numpy as np   # NumPy — Python's numerical backbone. Vectorised arithmetic in C.

def gd_multi_numpy(theta_init, learning_rate, iterations):
    theta = np.array(theta_init, dtype=float)   # wrap the list in a vector
    for _ in range(iterations):
        gradient = 2 * theta                    # operates on ALL elements simultaneously
        theta = theta - learning_rate * gradient
    return theta

result = gd_multi_numpy([5.0, -3.0, 2.0], learning_rate=0.1, iterations=25)
print(np.round(result, 4))
# -> [ 0.0189 -0.0113  0.0076]
pure python → numpy
gradients = [2 * t for t in theta]←→gradient = 2 * theta

one operation, all elements — broadcasting

theta = [t - lr * g for t, g in zip(...)]←→theta = theta - lr * gradient

vector subtraction, no Python loop

NumPy knows what 2 * theta means because the gradient of Σ θᵢ² is closed-form. Real networks chain thousands of operations — each layer's output is the input to the next, and each operation contributes its own slope to the final slope. That stacking of slopes has a name (the chain rule, coming up in the Backpropagation lesson), and it's why you can't hardcode the gradient of GPT-4 by hand. That's why PyTorch exists: it computes gradients automatically.

layer 3 — pytorch · gd_pytorch.py
python
import torch
import torch.optim as optim    # every optimizer lives here — SGD, Adam, RMSProp, all of them

# requires_grad=True tells PyTorch to track this tensor for automatic differentiation.
theta = torch.tensor([5.0, -3.0, 2.0], requires_grad=True)

# optim.SGD is the packaged version of our update rule: theta -= lr * gradient.
optimizer = optim.SGD([theta], lr=0.1)

for step in range(25):
    optimizer.zero_grad()        # PyTorch accumulates gradients — reset each step
    loss = (theta ** 2).sum()    # f(theta) = sum(theta_i ** 2)
    loss.backward()              # compute gradients automatically (autograd)
    optimizer.step()             # apply: theta = theta - lr * gradient

print(torch.round(theta.detach(), decimals=4))
stdout
tensor([ 0.0189, -0.0113,  0.0076])
numpy → pytorch
loss = (theta ** 2).sum()←→f(θ) = Σ θᵢ²

same objective, defined in Python

loss.backward()←→gradient = 2 * theta

we hardcoded this — autograd derives it from the loss expression

optimizer.step()←→theta = theta - lr * gradient

identical update, packaged as a single call

Now the hard part. Everything so far has been f(x) = x² — a cooperative bowl with a single minimum and smooth curvature everywhere. Your intuition has been trained on the friendliest function that exists.

Real loss surfaces are not cooperative. They have local minima — small bowls nested inside the big one. They have saddle points, where the ground goes up in one direction and down in another — the hiker's foot feels flat on average, even though they're nowhere near a valley floor. They have plateaus where progress dies for hundreds of steps at a time. Drop the marble here, then there, then somewhere else — and watch gradient descent give you a different answer every time.

non-convex surface — loading…
initializing WebGL scene…

Same algorithm. Same α. Different starting points, wildly different final answers. That's not a bug in gradient descent — it's the nature of non-convex optimization, and it's why “initialization” is a real line item in every modern training pipeline. The marble only ever sees the local slope, so where you drop it matters.

The honest takeaway is slightly unsettling: in a big enough neural net, you're never really finding the minimum. You're finding a minimum. Fortunately, a weird empirical fact rescues us — in high dimensions, the local minima of realistic loss surfaces tend to have nearly identical loss values. The surface is messy, but most of the messy spots are about equally good. Hundreds of valleys, each one a fine place to stop. More on that when we get to the training chapter.

One more thing vanilla gradient descent is terrible at: narrow ravines. Long thin valleys where the slope across the valley is huge and the slope along it is tiny. The marble burns every step ping-ponging across the ravine and barely any traveling forward. The cure is inertia — give the marble momentum so it smooths out the zigzag and picks up speed along the valley floor. Watch the two of them side by side.

vanilla vs momentum — a narrow ravine
f(x,y) = 0.05x² + 4y²·same start · same α
vanilla GDx ← x − α∇f
loss 10.012
momentumv ← βv − α∇f ; x ← x + v
loss 10.012
Δ loss+0.000

Blue is vanilla. Gold is momentum. Same start, same α — one of them ends near the minimum, the other is still zigzagging. The whole zoo of modern optimizers (Adam, RMSProp, Adafactor) are variations on the same idea: carry state across steps so a single noisy gradient can't knock you off course. We'll build momentum from scratch in a later lesson. For now just note: this is what people mean when they say “SGD with momentum.”

Break it on purpose

Bump lr past 0.5. Within a handful of steps x starts bouncing across zero. Push it past 1.0 and it runs away to infinity. That's the convergence condition from (1.2) |1 − 2α| < 1 — enforcing itself in real code.

Try it. Change lr, re-run, then graph trail mentally against the rule.

starter · break_it.py
# Vanilla gradient descent on f(x) = x^2. Starts at x = 5.
# Challenge: find the largest lr for which |x_25| < 1e-3.
lr = 0.1            # try 0.5, 0.9, 1.0, 1.1 — watch the regime flip
steps = 25
x = 5.0
trail = [x]
for _ in range(steps):
    grad = 2 * x
    x = x - lr * grad
    trail.append(x)

print(f"lr = {lr}")
print(f"final x       = {x:+.5f}")
print(f"max |x| seen  = {max(abs(v) for v in trail):+.5f}")
print(f"converged?    = {abs(x) < 1e-3}")

What to carry forward. Gradient descent is the loop inside every training algorithm in ML — not optional, load-bearing. The learning rate isn't just a hyperparameter; it's a convergence condition, and breaking it breaks the model completely. The three-layer progression — pure Python, NumPy, PyTorch — is the same progression we'll use for every algorithm in this series: see the mechanics, scale them up, then cede them to the library. Every layer above is a shortcut for a layer below, never magic.

Next up — Sigmoid & ReLU. Gradient descent needs a function to differentiate. Inside a neural net that function is a stack of matrix multiplies with a little shape-bending non-linearity wedged between each layer — and the whole thing collapses into a single straight line without those non-linearities. The two most common are sigmoid and ReLU. Small, humble, and they decide what “firing” means for a neuron. Their derivatives plug directly into the update rule you just learned — and one of them has a bad habit of murdering gradients entirely. We'll find out which, and why.

what next
quiz

You triple the learning rate on a model that was training fine at lr=0.01. The loss shoots up, then to NaN. What's the mechanical story?

References