Cross-Entropy Loss

The canonical classification loss — from KL divergence down.

Medium

~15 min read

·lesson 4 of 6

Picture a pop quiz where you don't just answer — you also write down how sure you are. 99% confident and right? Full marks, basically free. 99% confident and wrong? The teacher takes out a very large red pen. And if you shrug and spread your bet evenly across every option, you get graded somewhere in the middle no matter what the real answer is.

That grading rubric is cross-entropy loss. It's the loss function that reads confidence, not just correctness. A softmax output is a claim about the world — “I'm 80% sure it's a cat” — and cross-entropy is the receipt the world hands back. Confident and right is cheap. Confident and wrong is ruinous. Uncertain is mediocre either way.

You could, technically, just use squared error. Subtract the predicted probabilities from the truth, square it, move on. Every deep-learning library ships a cross-entropy op instead. By the end of this page you'll know why — and why the gradient through the whole softmax-then-cross-entropy stack collapses to an expression so short you could tattoo it on your wrist: p − y.

Start concrete. Your model produced a probability distribution p = [p₁, p₂, …, p_K] over K classes. The training label is the truth — also a distribution, just a boring one: 1 on the correct class, 0 everywhere else. Call that one-hot vector y, and call the index it points at t. Cross-entropy is one line.

cross-entropy — the full thing and its one-hot simplification

H(y, p)   =   − Σᵢ yᵢ · log pᵢ

          =   − log p_t           (because y is one-hot)

The second line is the one you'll use in practice. Take the probability your model assigned to the correct class, log it, flip the sign. That's your score on the quiz. This is also why you'll hear cross-entropy called negative log likelihood in half the papers you read — same number, different accent.

Drop the target onto any class below, then drag the predicted bars around. Perfect match gives 0. Uniform guessing over 5 classes — that's the “I have no idea” answer on the quiz — gives log(5) ≈ 1.61 no matter which class is correct. Confidently wrong sends the loss toward infinity, which is the whole point of the next few paragraphs.

cross-entropy — the distance from right

H(y, p) = − log p[target]

predicted p

cat

20.0%

dog

20.0%

bird

20.0%

fish

20.0%

fox

20.0%

drag any row · values auto-normalise

target y (one-hot)

cat

100%

dog

bird

fish

fox

exactly one class is correct

target

p(target)0.200

loss1.609

Cross-entropy (personified)

I am not a reasonable grader. A 70% confident right answer costs you 0.36 — sure, fine, have your lunch money. A 99% confident right answer costs basically zero. But predict 0.01 for the true class and I charge you 4.6, and I keep climbing toward infinity as your confidence in the wrong answer approaches 1. Be uncertain when you're wrong. Be sure when you're right. Anything else is expensive.

Worth pausing on the alternative, because it's the one most people reach for first. Squared error is the loss you grew up with: subtract, square, average. It works beautifully on regression — that lesson is coming — and it's tempting to just point it at probability vectors and call it a day.

Try it. If the truth is 1 and you predict 0.99, MSE gives you 0.0001. If you predict 0.01, MSE gives you 0.98. Those two outcomes — nearly right, completely wrong — differ by a factor of about ten thousand. Cross-entropy puts them at 0.01 versus 4.6: a factor of five hundred. In MSE the wrong-end of the curve is nearly flat; the model gets almost no gradient signal when it's confidently wrong, which is exactly the moment you most need to yell at it.

Put another way: MSE grades your pop quiz by counting how far off your probability was. Cross-entropy grades it by asking how surprised reality was to hear your answer. The second one trains faster, because surprise scales with confidence, and confidence is what you actually want the network to calibrate.

The binary case is worth staring at alone. One output p — the predicted probability that y = 1 — and the loss collapses to a two-term sum:

binary cross-entropy — the workhorse of every sigmoid output

L   =   −[ y · log p   +   (1 − y) · log(1 − p) ]

When y = 1:   L = − log p           (punishes underconfident positives)
When y = 0:   L = − log(1 − p)      (punishes overconfident positives)

Two curves, one for each value of the true label. Slide p and watch the loss climb asymptotically toward the wrong answer. Flip the true label and the mirror-image curve takes over. Both are the same quiz rubric, just looking at the problem from the two sides.

binary cross-entropy — how confidence translates to loss

L = −[y·log p + (1−y)·log(1−p)]

true y

predicted p0.70

loss0.357

verdictreasonable

Here's the part that makes libraries fuse softmax and cross-entropy into a single op. In practice your network doesn't output p directly — it outputs raw logits z, and softmax turns those into p. The loss depends on p, which depends on z, so the gradient the gradient descent step actually sees is ∂L/∂z. You'd expect an awful mess: softmax has exponentials, cross-entropy has a log, and they're stacked. They cancel.

the gradient that changes everything

∂L                p_i − y_i
───   =    ─────────────────────
∂z_i                1

(yes, it really is just p − y)

The gradient on each logit is the predicted probability minus the target probability. Every exponential and every log in the forward pass has annihilated itself in the backward pass. This isn't luck — softmax and cross-entropy were designed as partners, and this cancellation is the reason the pair is called the canonical classification head.

softmax + cross-entropy = the prettiest gradient in ML

∂L/∂zᵢ = pᵢ − yᵢ

logit z

p = softmax(z)

gradient p − y

cat• true

1.80

56.9%

-0.431

dog

0.90

23.1%

+0.231

bird

0.20

11.5%

+0.115

fish

-0.50

5.7%

+0.057

fox

-1.20

2.8%

+0.028

Rose bars push the logit down. Green bars push it up. The true class always gets a green push. Every other class gets a rose push proportional to how much probability mass it stole.

target

loss0.564

‖grad‖0.507

Drag any logit. Every row's gradient updates live. The true class (green) always gets a negative gradient — “push this logit up.” Every other class gets a positive gradient — “push this logit down, you stole probability mass that wasn't yours to claim.” The size of the push on each wrong class is proportional to how much mass it's currently hoarding. The update rule balances itself.

Gotchas

log(0) is negative infinity, and floats know it. The quiz rubric's “infinity punishment for confident-wrong” is a beautiful property in math and a disaster in code. If your model ever outputs a probability of exactly 0 for the true class, −log(0) returns inf, the gradient becomes nan, and every parameter downstream is corrupted forever. The fix: never compute softmax then log separately. Use the fused log-softmax trick (subtract the max logit, sum the exps, log that) so the intermediate never touches 0.

Never hand-compute softmax then log then cross-entropy as three calls. Use your library's fused op — nn.CrossEntropyLoss in PyTorch, sparse_categorical_crossentropy(from_logits=True) in Keras. More numerically stable, uses the clean gradient formula.

CrossEntropyLoss expects logits, not probabilities. Feed it raw output — not the result of applying softmax yourself. If you softmax-then-crossentropy, you're double-softmaxing and your gradients are wrong. This is the single most common bug in beginner PyTorch code.

Label smoothing is a cheap regulariser that borrows straight from the quiz metaphor: force the teacher to stop accepting 100%-confident answers even when they're right. Replace the hard one-hot target with 0.9 for the correct class and 0.1 / (K−1) spread across the rest. Cross-entropy is now graded against a smoothed target, which prevents the model from becoming the annoying student who bets everything on one answer.

Three layers as always. Read top to bottom and watch the boundary between softmax and the loss dissolve as you move up the stack.

cross_entropy_scratch.py

import math

def softmax(z):
    m = max(z)
    exps = [math.exp(v - m) for v in z]
    s = sum(exps)
    return [e / s for e in exps]

def cross_entropy(probs, target_idx):
    # H(y, p) = -log p[target]
    return -math.log(max(probs[target_idx], 1e-12))

logits = [2.0, 1.2, 0.3, -0.8, -2.0]
probs = softmax(logits)
loss = cross_entropy(probs, target_idx=0)
print(f"probs={[round(p, 4) for p in probs]}")
print(f"loss={loss:.4f}")

stdout

probs=[0.6439, 0.2896, 0.0466, 0.0155, 0.0044]
loss=0.4404

cross_entropy_numpy.py

import numpy as np

def softmax(z, axis=-1):
    z = z - np.max(z, axis=axis, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=axis, keepdims=True)

def cross_entropy(logits, targets):
    # logits: (N, K)   targets: (N,) integer indices
    # Combined log-softmax + NLL for numerical stability.
    logits = logits - np.max(logits, axis=-1, keepdims=True)
    log_sum_exp = np.log(np.exp(logits).sum(axis=-1))
    log_probs = logits - log_sum_exp[:, None]
    return -log_probs[np.arange(len(targets)), targets].mean()

# Batch of 3 examples, 5 classes
logits = np.array([
    [2.0, 1.2, 0.3, -0.8, -2.0],
    [-1.0, -0.5, 2.0, 0.1, -0.3],
    [0.5, 0.5, 0.5, 0.5, 0.5],
])
targets = np.array([0, 2, 1])
print(f"loss = {cross_entropy(logits, targets):.4f}")
# -> 0.7456

cross_entropy_pytorch.py

python

import torch
import torch.nn.functional as F

logits = torch.tensor([
    [2.0, 1.2, 0.3, -0.8, -2.0],
    [-1.0, -0.5, 2.0, 0.1, -0.3],
    [0.5, 0.5, 0.5, 0.5, 0.5],
], requires_grad=True)
targets = torch.tensor([0, 2, 1])

# F.cross_entropy takes raw logits and integer targets.
# It fuses log_softmax + NLL internally. Never softmax beforehand.
loss = F.cross_entropy(logits, targets)
print(f"loss = {loss.item():.4f}")

loss.backward()
# First row's gradient is exactly softmax(logits[0]) - onehot(target[0]) / batch_size
print("grads first row:", logits.grad[0])

stdout

loss = 0.7456
grads first row: tensor([-0.3561,  0.2896,  0.0466,  0.0155,  0.0044])

pure python → numpy (fused log-softmax + NLL)

probs = softmax(z); loss = -log(probs[t])←→loss = -(logits - logsumexp(logits))[t]

— fuses the two ops — avoids computing probs[t] then logging it

one example at a time←→logits[arange(N), targets] # pick per-row

— fancy indexing — no Python loop over the batch

numpy → pytorch

log_probs = logits - log_sum_exp ; loss = -log_probs[i, t]←→F.cross_entropy(logits, targets)

— one call, numerically stable, GPU-aware, autograd-ready

grad = probs - onehot(target)←→loss.backward()

— autograd computes exactly p - y — the identity from above

Double-softmax bug hunt

A sneaky one. Two callers pass into cross_entropy. Version A sends logits. Version B sends probabilities — which is wrong, because cross_entropy already applies log-softmax internally. Version B still trains; it just learns much more slowly, because its gradients are squashed. The scariest kind of bug: the one that doesn't crash.

The starter runs both on identical inputs and prints the loss plus the gradient norm. Compare. B's gradient is visibly smaller.

starter · double_softmax.py

import numpy as np

# A single example, 3 classes, label = class 0.
logits = np.array([2.0, 1.0, 0.1])
y_true = 0

def softmax(z):
    z = z - z.max()
    p = np.exp(z)
    return p / p.sum()

def log_softmax(z):
    z = z - z.max()
    return z - np.log(np.exp(z).sum())

# Version A: correct. cross_entropy(logits, y) = -log_softmax(logits)[y]
# Gradient w.r.t. logits: softmax(logits) - onehot(y)
loss_A = -log_softmax(logits)[y_true]
grad_A = softmax(logits) - np.eye(3)[y_true]

# Version B: bug. We already applied softmax, then shove it back through log_softmax.
probs  = softmax(logits)
loss_B = -log_softmax(probs)[y_true]
# Chain rule: dL/dlogits = (softmax(probs) - onehot) @ J_softmax(logits)
# Easier to just get the end-to-end grad numerically.
eps = 1e-5
grad_B = np.zeros_like(logits)
for i in range(len(logits)):
    lp = logits.copy(); lp[i] += eps
    lm = logits.copy(); lm[i] -= eps
    f = lambda L: -log_softmax(softmax(L))[y_true]
    grad_B[i] = (f(lp) - f(lm)) / (2 * eps)

print(f"A  loss = {loss_A:.5f}    grad norm = {np.linalg.norm(grad_A):.5f}")
print(f"B  loss = {loss_B:.5f}    grad norm = {np.linalg.norm(grad_B):.5f}")
print(f"B gradient is {np.linalg.norm(grad_B)/np.linalg.norm(grad_A):.2%} the size of A's.")

What to carry forward. Cross-entropy is the confidence-scored quiz: the loss scales with how surprised the truth was to hear your answer. That's why confident-wrong diverges and confident-right is nearly free — the log was built for exactly that shape. Softmax and cross-entropy fit together so cleanly that the gradient on the raw logits collapses to p − y, which is the entire reason every library fuses them into one op. Never call softmax before cross-entropy yourself — use F.cross_entropy and hand it logits.

Next up — Linear Regression (Forward). We've been grading classifiers. Time to step back and look at the simplest model that outputs a number instead of a class: a linear predictor. No softmax, no probabilities, just y = Wx + b. You'll wire it up as a matrix multiply, visualise the whole forward pass, and set up the training lesson where you'll fit it two ways — closed form and gradient descent — and watch them disagree about which one should have won.

References

[01]
A Mathematical Theory of Communication
Claude Shannon · Bell System Technical Journal, 1948 — the paper that invented entropy
[02]
Dive into Deep Learning — 3.4.6 Cross-Entropy Loss
Zhang, Lipton, Li, Smola · d2l.ai
[03]
When Does Label Smoothing Help?
Müller, Kornblith, Hinton · NeurIPS 2019