Training Diagnostics

Loss curves, grad norms, and the art of debugging.

Medium

~15 min read

·lesson 2 of 4

Loss is going down. That's the first thing everyone checks, and it's the first thing that lies to you. Your training loop is turning, the curve is slipping to the right and down, and somewhere inside the model a layer has been clinically dead for three hundred steps. You wouldn't know. Not from loss alone.

This is the stethoscope lesson. Training a neural net looks fine from the outside — one number, trending in the right direction — but inside, things quietly rot. A layer dies. Gradients vanish. One weight dominates. Updates shrink to nothing. A bad initialization smuggles in a disaster on step zero that doesn't surface until step eight thousand. You can't fix what you can't see, and “the loss is decreasing” sees almost nothing.

So we do what doctors do. We don't ask one question. We run the exam. Training diagnostics is the full physical: the loss curve as your EKG, the gradient norm as your pulse, the parameter/update ratio as your blood pressure, activation histograms as your skin tone, the learning-rate schedule as your sleep hygiene. A healthy model has a characteristic rhythm on each instrument. When the rhythm changes, you diagnose — you don't just keep training and hope the patient walks it off.

Start with the EKG. Below is a gallery of the loss curves you will actually see in the wild. Read them like a doctor reads a rhythm strip: each shape is a diagnosis, each diagnosis has a standard intervention. The skill — and it is a real skill, one you will use on every project for the rest of your career — is recognizing the shape inside thirty seconds and naming the cause.

loss curve gallery — six failure modes, one plot

click a card to load · log-y

diagnosisTextbook. Loss decays roughly as 1/t, noise is small relative to signal, val stays just above train.

prescriptionShip it. Log anyway in case something regresses next epoch.

train0.259

val0.368

gap0.109

Most of these are fixed with one knob: a smaller learning rate, a regularizer, more data. One is different. The diverging curve is the only one where every additional step makes the patient worse. If the loss turns upward after step 100 and you keep training, you are paying GPU time to actively ruin your model. Stop the run.

Loss curve (personified)

I am the first thing a competent engineer looks at, and I am also the most reliable liar in the room. I will not tell you why your model is failing — but I will tell you, within seconds, that it is, and what kind of failure it is. Learn my shapes. I have six of them. That's enough to diagnose nine problems in ten. For the tenth, you're going to need the other instruments.

the per-batch loss variance — why curves are noisy at all

L_batch   =   (1/|B|) · Σ_{i ∈ B}   ℓ(θ; xᵢ, yᵢ)

Var(L_batch)   =   Var(ℓ) / |B|

Halve the batch size → double the curve's wiggle.
Apply an exponential moving average to the loss log → smoother but lagged.
If the curve wiggles past what the EMA predicts, your LR is too high.

Here is the lie you need to know about, because it will cost you a week the first time it happens. A loss curve can slope gently down while a chunk of the model is dead. Gradients upstream don't reach the dead layer, the live parameters keep fitting the residual, and the overall loss improves — just slower than it should, with a ceiling you cannot explain. The EKG looks fine. The patient is quietly missing a lung. This is where the stethoscope moves to the next instrument.

Loss tells you what. Gradient norms tell you where. If loss is going nowhere, or going somewhere too slowly, the next listen is per-layer gradient magnitude. Vanishing in the deep layers? You have an activation-or-init problem, and the fix lives in sigmoid/ReLU choice or Xavier/He scaling. Exploding? You need gradient clipping or a smaller step. Cycle through the four regimes in the widget and watch the per-layer bars while the loss looks roughly identical on top.

per-layer gradient norms — the diagnostic you'll use most

log₁₀ scale · 6-layer net · green band = healthy

verdicthealthy · all layers receive gradient

L1 / L6 ratio—

Gradient norm (personified)

I tell you whether signal is reaching every layer. Averaged across all parameters I'm a blunt instrument — one number for the whole body. Logged per-layer I'm surgical. If I'm 10¹ at layer 1 and 10⁻⁹ at layer 6, your loss can look perfectly healthy while forty percent of your model has stopped learning. Read me layer by layer or don't read me at all.

Two instruments down. The third asks a different question — not is the model training, but is the model learning the right thing. A model that memorizes your training set will keep getting better on train loss while val loss climbs. Train goes to zero; the model is an honor student on the homework and a disaster on the exam. You can produce this failure at will by shrinking the dataset. You can cure it with regularization.

overfitting — the train/val gap and how to close it

synthetic classifier · rose shading = generalisation gap

dataset N200dropoutweight decayaugmentation

gap0.047

verdictno overfit

Now write the instruments. Same three-layer progression as every other lesson in the course. Pure Python: print every loss and grad norm by hand — good for understanding, useless at scale. NumPy: aggregate the numbers so you can plot them later. PyTorch: hook into the real machinery with TensorBoard and per-layer backward hooks. The third layer is what production training scripts actually do.

layer 1 — pure python · minimal_diagnostics.py

python

import random, math
random.seed(0)

w, b = 0.1, 0.0
lr = 0.05
data = [(random.gauss(0,1), random.gauss(0,1)) for _ in range(100)]
targets = [2 * x - 1 for x, _ in data]

for step in range(50):
    # Forward + loss + grad
    total_loss = 0.0
    gw = 0.0; gb = 0.0
    for (x, _), y in zip(data, targets):
        yhat = w * x + b
        err = yhat - y
        total_loss += err * err
        gw += 2 * err * x
        gb += 2 * err
    total_loss /= len(data); gw /= len(data); gb /= len(data)

    # Grad norm (per "parameter", here a scalar each)
    grad_norm = math.sqrt(gw * gw + gb * gb)

    # The tracing you need: loss, grad_norm, param_norm, LR
    if step % 10 == 0:
        print(f"step {step:3d}  loss={total_loss:.4f}  |grad|={grad_norm:.4f}  "
              f"|param|={math.sqrt(w*w + b*b):.4f}")

    w -= lr * gw; b -= lr * gb

layer 2 — numpy · aggregated diagnostics

python

import numpy as np

def run_with_diagnostics(X, y, lr=0.05, steps=100):
    w = np.zeros(X.shape[1])
    b = 0.0
    logs = {"loss": [], "grad_norm": [], "param_norm": []}
    for _ in range(steps):
        yhat = X @ w + b
        err = yhat - y
        loss = (err * err).mean()
        gw = (2 * X.T @ err) / len(X)
        gb = (2 * err).mean()
        grad_norm = np.sqrt((gw ** 2).sum() + gb ** 2)
        logs["loss"].append(loss)
        logs["grad_norm"].append(grad_norm)
        logs["param_norm"].append(np.sqrt((w ** 2).sum() + b ** 2))
        w -= lr * gw; b -= lr * gb
    return logs

pure python → numpy

print every 10 steps←→append every step to a dict of lists

— can plot / analyse later

manual sum of squared grads←→(gw ** 2).sum() via numpy

— scales to million-parameter models

layer 3 — pytorch · production diagnostics

python

import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter

model = nn.Sequential(nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 1))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
writer = SummaryWriter()

# Register a hook on each Linear to record per-layer grad norm
per_layer_norms = {}
def make_hook(name):
    def hook(module, grad_input, grad_output):
        per_layer_norms[name] = grad_output[0].norm().item()
    return hook

for name, m in model.named_modules():
    if isinstance(m, nn.Linear):
        m.register_full_backward_hook(make_hook(name))

for step in range(1000):
    optimizer.zero_grad()
    x = torch.randn(64, 10); y = torch.randn(64, 1)
    yhat = model(x)
    loss = nn.functional.mse_loss(yhat, y)
    loss.backward()

    # The diagnostics you want on every run
    writer.add_scalar("loss", loss.item(), step)
    total_grad_norm = sum((p.grad ** 2).sum().item() for p in model.parameters() if p.grad is not None) ** 0.5
    writer.add_scalar("grad_norm/total", total_grad_norm, step)
    for name, n in per_layer_norms.items():
        writer.add_scalar(f"grad_norm/{name}", n, step)

    optimizer.step()

numpy → pytorch

manual dict of lists←→tensorboard SummaryWriter

— live plots in the browser as training runs

aggregate grad norm←→per-layer grad norm via backward hooks

— where the diagnostic signal actually is

loss only←→loss + grad_norm + param_norm + lr

— the four quantities every training script should log

Gotchas

Looking only at train loss. Train loss going to zero means nothing — it means the model memorised the training set. Always log val loss. Always. Even on tiny runs.

Smoothing too aggressively. TensorBoard's default EMA can hide loss spikes for 50+ steps. When investigating a divergence, look at the raw unsmoothed curve. The EMA is the patient's calm voice in the waiting room; the raw curve is the actual vitals.

Not logging LR. When something weird happens at step 8000, the answer is often “the scheduler just dropped LR by 10×” or “warmup just finished.” Log LR on the same timestep axis as loss.

Confusing EMA loss with raw loss. The number you print can be an EMA (smooth but lagged) or the raw per-batch value (current but noisy). Two different numbers; don't conflate them when comparing runs.

Reproduce 4 failure modes on purpose

Take a small MLP on MNIST. Reproduce, in order: (1) diverging loss by setting LR to 10. (2) A plateau by setting LR to 1e-6. (3) Overfit by shrinking the dataset to 100 examples. (4) Vanishing gradients by using sigmoid in every hidden layer of a 10-layer network. Log train loss, val loss, and total gradient norm for each. Save the plots. Assemble a one-page visual guide of “these are the four most common problems, and this is what they look like.” Tape it above your desk. You will use it.

What to carry forward. Training curves tell you what kind of failure you have. Gradient norms per layer tell you where the failure lives. The generalisation gap tells you whether you're learning or memorising. Log all three on every training run, from the smallest experiment to the largest pre-train. Loss alone is a rhythm strip, not a diagnosis. The time you save debugging with the full exam is enormous and the added cost is basically zero.

Next up — Dead ReLU Detector. Of all the things that can silently rot inside a healthy-looking curve, one failure mode is so common, so invisible to aggregate stats, and so lethal to capacity that it earns its own lesson. Gradient norms can look fine on average while half your neurons are clinically dead — no signal in, no signal out, no gradient, no recovery. The next lesson builds the dedicated instrument for that hunt.

References

[01]
Deep Learning — Chapter 11: Practical Methodology
Goodfellow, Bengio, Courville · MIT Press, 2016
[02]
A Recipe for Training Neural Networks
Andrej Karpathy · karpathy.github.io, 2019
[03]
Practical Recommendations for Gradient-Based Training of Deep Architectures
Yoshua Bengio · arXiv 2012