Cross-Entropy Loss
The canonical classification loss — from KL divergence down.
Picture a pop quiz where you don't just answer — you also write down how sure you are. 99% confident and right? Full marks, basically free. 99% confident and wrong? The teacher takes out a very large red pen. And if you shrug and spread your bet evenly across every option, you get graded somewhere in the middle no matter what the real answer is.
That grading rubric is cross-entropy loss. It's the loss function that reads confidence, not just correctness. A softmax output is a claim about the world — “I'm 80% sure it's a cat” — and cross-entropy is the receipt the world hands back. Confident and right is cheap. Confident and wrong is ruinous. Uncertain is mediocre either way.
You could, technically, just use squared error. Subtract the predicted probabilities from the truth, square it, move on. Every deep-learning library ships a cross-entropy op instead. By the end of this page you'll know why — and why the gradient through the whole softmax-then-cross-entropy stack collapses to an expression so short you could tattoo it on your wrist: p − y.
Start concrete. Your model produced a probability distribution p = [p₁, p₂, …, p_K] over K classes. The training label is the truth — also a distribution, just a boring one: 1 on the correct class, 0 everywhere else. Call that one-hot vector y, and call the index it points at t. Cross-entropy is one line.
H(y, p) = − Σᵢ yᵢ · log pᵢ
= − log p_t (because y is one-hot)The second line is the one you'll use in practice. Take the probability your model assigned to the correct class, log it, flip the sign. That's your score on the quiz. This is also why you'll hear cross-entropy called negative log likelihood in half the papers you read — same number, different accent.
Drop the target onto any class below, then drag the predicted bars around. Perfect match gives 0. Uniform guessing over 5 classes — that's the “I have no idea” answer on the quiz — gives log(5) ≈ 1.61 no matter which class is correct. Confidently wrong sends the loss toward infinity, which is the whole point of the next few paragraphs.
I am not a reasonable grader. A 70% confident right answer costs you0.36— sure, fine, have your lunch money. A 99% confident right answer costs basically zero. But predict0.01for the true class and I charge you4.6, and I keep climbing toward infinity as your confidence in the wrong answer approaches 1. Be uncertain when you're wrong. Be sure when you're right. Anything else is expensive.
Worth pausing on the alternative, because it's the one most people reach for first. Squared error is the loss you grew up with: subtract, square, average. It works beautifully on regression — that lesson is coming — and it's tempting to just point it at probability vectors and call it a day.
Try it. If the truth is 1 and you predict 0.99, MSE gives you 0.0001. If you predict 0.01, MSE gives you 0.98. Those two outcomes — nearly right, completely wrong — differ by a factor of about ten thousand. Cross-entropy puts them at 0.01 versus 4.6: a factor of five hundred. In MSE the wrong-end of the curve is nearly flat; the model gets almost no gradient signal when it's confidently wrong, which is exactly the moment you most need to yell at it.
Put another way: MSE grades your pop quiz by counting how far off your probability was. Cross-entropy grades it by asking how surprised reality was to hear your answer. The second one trains faster, because surprise scales with confidence, and confidence is what you actually want the network to calibrate.
The binary case is worth staring at alone. One output p — the predicted probability that y = 1 — and the loss collapses to a two-term sum:
L = −[ y · log p + (1 − y) · log(1 − p) ] When y = 1: L = − log p (punishes underconfident positives) When y = 0: L = − log(1 − p) (punishes overconfident positives)
Two curves, one for each value of the true label. Slide p and watch the loss climb asymptotically toward the wrong answer. Flip the true label and the mirror-image curve takes over. Both are the same quiz rubric, just looking at the problem from the two sides.
Here's the part that makes libraries fuse softmax and cross-entropy into a single op. In practice your network doesn't output p directly — it outputs raw logits z, and softmax turns those into p. The loss depends on p, which depends on z, so the gradient the gradient descent step actually sees is ∂L/∂z. You'd expect an awful mess: softmax has exponentials, cross-entropy has a log, and they're stacked. They cancel.
∂L p_i − y_i ─── = ───────────────────── ∂z_i 1 (yes, it really is just p − y)
The gradient on each logit is the predicted probability minus the target probability. Every exponential and every log in the forward pass has annihilated itself in the backward pass. This isn't luck — softmax and cross-entropy were designed as partners, and this cancellation is the reason the pair is called the canonical classification head.
Drag any logit. Every row's gradient updates live. The true class (green) always gets a negative gradient — “push this logit up.” Every other class gets a positive gradient — “push this logit down, you stole probability mass that wasn't yours to claim.” The size of the push on each wrong class is proportional to how much mass it's currently hoarding. The update rule balances itself.
log(0) is negative infinity, and floats know it. The quiz rubric's “infinity punishment for confident-wrong” is a beautiful property in math and a disaster in code. If your model ever outputs a probability of exactly 0 for the true class, −log(0) returns inf, the gradient becomes nan, and every parameter downstream is corrupted forever. The fix: never compute softmax then log separately. Use the fused log-softmax trick (subtract the max logit, sum the exps, log that) so the intermediate never touches 0.
Never hand-compute softmax then log then cross-entropy as three calls. Use your library's fused op — nn.CrossEntropyLoss in PyTorch, sparse_categorical_crossentropy(from_logits=True) in Keras. More numerically stable, uses the clean gradient formula.
CrossEntropyLoss expects logits, not probabilities. Feed it raw output — not the result of applying softmax yourself. If you softmax-then-crossentropy, you're double-softmaxing and your gradients are wrong. This is the single most common bug in beginner PyTorch code.
Label smoothing is a cheap regulariser that borrows straight from the quiz metaphor: force the teacher to stop accepting 100%-confident answers even when they're right. Replace the hard one-hot target with 0.9 for the correct class and 0.1 / (K−1) spread across the rest. Cross-entropy is now graded against a smoothed target, which prevents the model from becoming the annoying student who bets everything on one answer.
Three layers as always. Read top to bottom and watch the boundary between softmax and the loss dissolve as you move up the stack.
import math
def softmax(z):
m = max(z)
exps = [math.exp(v - m) for v in z]
s = sum(exps)
return [e / s for e in exps]
def cross_entropy(probs, target_idx):
# H(y, p) = -log p[target]
return -math.log(max(probs[target_idx], 1e-12))
logits = [2.0, 1.2, 0.3, -0.8, -2.0]
probs = softmax(logits)
loss = cross_entropy(probs, target_idx=0)
print(f"probs={[round(p, 4) for p in probs]}")
print(f"loss={loss:.4f}")probs=[0.6439, 0.2896, 0.0466, 0.0155, 0.0044] loss=0.4404
import numpy as np
def softmax(z, axis=-1):
z = z - np.max(z, axis=axis, keepdims=True)
e = np.exp(z)
return e / e.sum(axis=axis, keepdims=True)
def cross_entropy(logits, targets):
# logits: (N, K) targets: (N,) integer indices
# Combined log-softmax + NLL for numerical stability.
logits = logits - np.max(logits, axis=-1, keepdims=True)
log_sum_exp = np.log(np.exp(logits).sum(axis=-1))
log_probs = logits - log_sum_exp[:, None]
return -log_probs[np.arange(len(targets)), targets].mean()
# Batch of 3 examples, 5 classes
logits = np.array([
[2.0, 1.2, 0.3, -0.8, -2.0],
[-1.0, -0.5, 2.0, 0.1, -0.3],
[0.5, 0.5, 0.5, 0.5, 0.5],
])
targets = np.array([0, 2, 1])
print(f"loss = {cross_entropy(logits, targets):.4f}")
# -> 0.7456import torch
import torch.nn.functional as F
logits = torch.tensor([
[2.0, 1.2, 0.3, -0.8, -2.0],
[-1.0, -0.5, 2.0, 0.1, -0.3],
[0.5, 0.5, 0.5, 0.5, 0.5],
], requires_grad=True)
targets = torch.tensor([0, 2, 1])
# F.cross_entropy takes raw logits and integer targets.
# It fuses log_softmax + NLL internally. Never softmax beforehand.
loss = F.cross_entropy(logits, targets)
print(f"loss = {loss.item():.4f}")
loss.backward()
# First row's gradient is exactly softmax(logits[0]) - onehot(target[0]) / batch_size
print("grads first row:", logits.grad[0])loss = 0.7456 grads first row: tensor([-0.3561, 0.2896, 0.0466, 0.0155, 0.0044])
probs = softmax(z); loss = -log(probs[t])←→loss = -(logits - logsumexp(logits))[t]— fuses the two ops — avoids computing probs[t] then logging it
one example at a time←→logits[arange(N), targets] # pick per-row— fancy indexing — no Python loop over the batch
log_probs = logits - log_sum_exp ; loss = -log_probs[i, t]←→F.cross_entropy(logits, targets)— one call, numerically stable, GPU-aware, autograd-ready
grad = probs - onehot(target)←→loss.backward()— autograd computes exactly p - y — the identity from above
A sneaky one. Two callers pass into cross_entropy. Version A sends logits. Version B sends probabilities — which is wrong, because cross_entropy already applies log-softmax internally. Version B still trains; it just learns much more slowly, because its gradients are squashed. The scariest kind of bug: the one that doesn't crash.
The starter runs both on identical inputs and prints the loss plus the gradient norm. Compare. B's gradient is visibly smaller.
import numpy as np
# A single example, 3 classes, label = class 0.
logits = np.array([2.0, 1.0, 0.1])
y_true = 0
def softmax(z):
z = z - z.max()
p = np.exp(z)
return p / p.sum()
def log_softmax(z):
z = z - z.max()
return z - np.log(np.exp(z).sum())
# Version A: correct. cross_entropy(logits, y) = -log_softmax(logits)[y]
# Gradient w.r.t. logits: softmax(logits) - onehot(y)
loss_A = -log_softmax(logits)[y_true]
grad_A = softmax(logits) - np.eye(3)[y_true]
# Version B: bug. We already applied softmax, then shove it back through log_softmax.
probs = softmax(logits)
loss_B = -log_softmax(probs)[y_true]
# Chain rule: dL/dlogits = (softmax(probs) - onehot) @ J_softmax(logits)
# Easier to just get the end-to-end grad numerically.
eps = 1e-5
grad_B = np.zeros_like(logits)
for i in range(len(logits)):
lp = logits.copy(); lp[i] += eps
lm = logits.copy(); lm[i] -= eps
f = lambda L: -log_softmax(softmax(L))[y_true]
grad_B[i] = (f(lp) - f(lm)) / (2 * eps)
print(f"A loss = {loss_A:.5f} grad norm = {np.linalg.norm(grad_A):.5f}")
print(f"B loss = {loss_B:.5f} grad norm = {np.linalg.norm(grad_B):.5f}")
print(f"B gradient is {np.linalg.norm(grad_B)/np.linalg.norm(grad_A):.2%} the size of A's.")What to carry forward. Cross-entropy is the confidence-scored quiz: the loss scales with how surprised the truth was to hear your answer. That's why confident-wrong diverges and confident-right is nearly free — the log was built for exactly that shape. Softmax and cross-entropy fit together so cleanly that the gradient on the raw logits collapses to p − y, which is the entire reason every library fuses them into one op. Never call softmax before cross-entropy yourself — use F.cross_entropy and hand it logits.
Next up — Linear Regression (Forward). We've been grading classifiers. Time to step back and look at the simplest model that outputs a number instead of a class: a linear predictor. No softmax, no probabilities, just y = Wx + b. You'll wire it up as a matrix multiply, visualise the whole forward pass, and set up the training lesson where you'll fit it two ways — closed form and gradient descent — and watch them disagree about which one should have won.
- [01]Claude Shannon · Bell System Technical Journal, 1948 — the paper that invented entropy
- [02]Zhang, Lipton, Li, Smola · d2l.ai
- [03]Müller, Kornblith, Hinton · NeurIPS 2019