Softmax

Turn raw logits into calibrated probabilities.

Easy

~15 min read

·lesson 3 of 6

Your network just spat out a vector of numbers — one per class — and now you have to answer two questions the user actually cares about: which class, and how sure are you? The raw numbers can be anything. Positive, negative, tiny, huge. They don't sum to anything. They aren't probabilities. They're opinions with no units.

Think of it like an exit poll before anyone has normalized the counts. One candidate got +2.1, another -0.8, a third +4.3. The numbers rank the field but mean nothing as percentages. You need a pollster — someone who takes raw opinion scores and turns them into “candidate A: 87%, candidate B: 11%, candidate C: 2%”. A clean distribution that adds up to one, with no negatives and no dishonest rounding.

Softmax is that pollster. It takes a vector of logits, keeps their ranking, and returns a genuine probability distribution — non-negative numbers that sum to exactly one. It is the last operation in practically every classifier you'll ever build.

The formula is small. The behavior is rich. And the one implementation detail that keeps it from detonating in production is your first real taste of numerical stability in this series.

The exponential function e^x does exactly that. It sends every real number to a positive number — big positives become huge, big negatives become tiny, and zero becomes one. Exponentiate first, then divide by the sum. Every output is now positive, and the whole vector sums to one. That's softmax.

softmax — the whole thing

               e^(zᵢ / T)
softmax(z)ᵢ = ─────────────
              Σⱼ e^(zⱼ / T)

Four moving parts. The exponent makes every output positive (because e^x is always positive). The sum in the denominator normalises the whole vector to one — this is the pollster writing “of 100 simulated voters” at the top of the chart. The temperature T controls how peaky the result is. That's it.

Drag the sliders below. Left column is logits, right column is the resulting probabilities. Push one class's logit above the others and watch its bar dominate — but notice the others never quite go to zero. Softmax is smooth, not greedy. It never declares a landslide when the poll hasn't earned one.

softmax explorer

softmax(z)ᵢ = e^(zᵢ/T) / Σⱼ e^(zⱼ/T)

logits z

cat

2.20

dog

1.70

bird

0.80

fish

-0.10

fox

-1.40

drag rows to change logits · purple = positive · rose = negative

probabilities p = softmax(z)

cat

50.5%

dog

30.6%

bird

12.4%

fish

5.1%

fox

1.4%

sums to 1 by construction · temperature warps the sharpness

T1.00

argmaxcat

entropy1.698 bits

Σp1.000

Softmax (personified)

I turn your opinions into a distribution. I'm smooth, differentiable, and order-preserving. Give me a big logit and I'll give that class most of the mass — but I always save a crumb for the losers so gradients can flow back to them. It's an inclusion policy, not a reward for being kind.

One subtle thing worth pointing out before we move on. Exponentials amplify. Small gaps in the logits become big gaps in the probabilities. A candidate ahead by two points in raw opinion can end up with 90% of the poll after softmax runs. That's not a bug — it's the whole reason we use exp instead of some gentler positive function. Confidence in the logits gets translated into confidence in the distribution, loudly.

Now the temperature knob. T divides every logit before the exponential, so T < 1 makes differences bigger (sharper distribution) and T > 1 makes them smaller (flatter). In the limit T → 0 softmax becomes argmax — all mass on the top class. In the limit T → ∞ it becomes uniform — every class equal.

Back to the pollster: T is the conviction dial. Low temperature, the pollster is calling a landslide — 98% for the leader, scraps for the rest. High temperature, the pollster is hedging — “the race is wide open, anyone could win” — and the percentages spread out toward uniform. Same logits. Same ranking. Totally different story about how confident you should be.

The plot below tracks entropy as a function of T for a fixed set of logits. Entropy is exactly a measure of uncertainty — it's zero when the distribution is one-hot and log₂(K) when the distribution is uniform over K classes. Slide T; watch the bar chart reshape and the entropy dot trace the curve.

temperature — from one-hot to uniform

fixed logits · T sweeps

probabilities

54.3%

24.4%

13.4%

5.4%

2.4%

T1.00

regimetypical

entropy1.723

H / Hmax0.74

Now the implementation catch. The textbook formula above is numerically unstable. Here's why: in a real language model the final logit vector can contain values like z = 842.3. Compute exp(842.3) in IEEE-754 double precision and you get Infinity. The denominator becomes Infinity. The division is NaN. Your model's prediction is now… nothing.

The fix is a one-line algebraic identity with enormous consequences. Subtract the max of the logit vector from every element before exponentiating. Mathematically you're multiplying top and bottom by exp(-max z), which cancels — the output is identical. Numerically, the largest exponent is now exactly 0, so exp never blows up.

the shift-by-max trick

softmax(z)ᵢ   =   e^(zᵢ)       /   Σⱼ e^(zⱼ)

              =   e^(zᵢ − m)  /   Σⱼ e^(zⱼ − m)      where m = max(z)

All exponents are now ≤ 0. Largest is exp(0) = 1. No overflow, ever.

See it fail and then stop failing. The left column below runs the naive formula, the right column runs the shift-by-max version. Crank the offset slider past 700 and the naive column collapses into NaNs; the stable column stays serene.

numerical stability — naive vs stable softmax

the 'subtract max' trick

naivesoftmax(z) = exp(z) / Σ exp(z)

z	exp(z)	p
2.0	7.389	54.31%
1.2	3.320	24.40%
0.6	1.822	13.39%
-0.3	0.741	5.45%
-1.1	0.333	2.45%

stableexp(z − max(z)) / Σ exp(z − max(z))

z − max	exp	p
0.0	1.000	54.31%
-0.8	0.449	24.40%
-1.4	0.247	13.39%
-2.3	0.100	5.45%
-3.1	0.045	2.45%

max shift is algebraically free · all exponents ≤ 0 · numerically bulletproof

logit offset0.00

naiveok

stableok

Gotchas

Never compute softmax by exp-then-divide in production code. Always subtract the max first. Every ML library already does this internally (torch.softmax, scipy.special.softmax, tf.nn.softmax all ship the stable version). But if you ever hand-roll it — you'll write the bug.

Softmax + cross-entropy are fused in PyTorch for even better stability. Use nn.CrossEntropyLoss which takes raw logits, not nn.Softmax followed by nn.NLLLoss. Next lesson unpacks why.

Softmax over one class is identity. If you find yourself applying softmax to a single-logit output (for regression or binary classification), stop — you want sigmoid instead. Softmax at K=2 is a reparameterization of sigmoid with one redundant parameter.

Three layers, three implementations of the same function. You've seen the pollster work by hand; now write it in pure Python, then NumPy, then PyTorch. Same function, progressively less of it visible.

softmax_scratch.py

import math

def softmax(z, temperature=1.0):
    z = [v / temperature for v in z]
    m = max(z)                                  # the stability trick
    exps = [math.exp(v - m) for v in z]         # all exponents ≤ 0
    s = sum(exps)
    return [e / s for e in exps]

probs = softmax([2.0, 1.2, 0.3, -0.8, -2.0])
print("probs=", [round(p, 4) for p in probs])
print("sum=", round(sum(probs), 4))

stdout

probs=[0.6439, 0.2896, 0.0466, 0.0155, 0.0044]
sum=1.0

softmax_numpy.py

import numpy as np

def softmax(z, temperature=1.0, axis=-1):
    z = z / temperature
    z = z - np.max(z, axis=axis, keepdims=True)       # broadcast-safe subtraction
    exps = np.exp(z)
    return exps / np.sum(exps, axis=axis, keepdims=True)

batch = np.array([
    [2.0, 1.2, 0.3, -0.8, -2.0],
    [0.1, 0.1, 0.1, 0.1, 0.1],                        # flat → uniform
])
print(np.round(softmax(batch), 4))
# -> [[0.6439 0.2896 0.0466 0.0155 0.0044]
#     [0.2    0.2    0.2    0.2    0.2   ]]

softmax_pytorch.py

python

import torch
import torch.nn.functional as F

logits = torch.tensor([
    [2.0, 1.2, 0.3, -0.8, -2.0],
    [0.1, 0.1, 0.1, 0.1, 0.1],
])

# torch.softmax and F.softmax are the same function — pick whichever you like.
probs = F.softmax(logits, dim=-1)           # dim=-1 = classes axis
print(torch.round(probs, decimals=4))

# log-softmax is also its own op — more stable when the next step is a log.
log_probs = F.log_softmax(logits, dim=-1)   # log(softmax(x)) without the intermediate

stdout

tensor([[0.6439, 0.2896, 0.0466, 0.0155, 0.0044],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000]])

pure python → numpy

m = max(z); exps = [math.exp(v - m) for v in z]←→np.exp(z - np.max(z, axis=-1, keepdims=True))

— broadcasting along the class axis — works for batches for free

s = sum(exps); return [e / s for e in exps]←→exps / exps.sum(axis=-1, keepdims=True)

— vectorised normalisation, no loops

Build GPT-style temperature sampling

Take a vector of logits, apply temperature-softmax, and sample tokens. Tweak T below and re-run: low temperature collapses onto the argmax, high temperature flattens the distribution toward uniform. This is the sampling loop at the heart of every LLM deployment in existence.

Bonus: uncomment the top-k block — zero out every probability except the top k, renormalise, then sample. Watch the diversity collapse.

starter · temperature_sampling.py

import numpy as np

# A tiny vocabulary so the output is readable.
vocab   = ["the", "a", "cat", "dog", "sat", "ran", "slept", "<eos>"]
logits  = np.array([3.0, 2.5, 1.8, 1.2, 0.9, 0.6, 0.2, -0.4])

def softmax_T(logits, T):
    z = logits / T
    z = z - z.max()                 # shift for numerical stability
    p = np.exp(z)
    return p / p.sum()

def sample(probs, n, rng):
    return rng.choice(len(probs), size=n, p=probs)

rng = np.random.default_rng(0)

for T in (0.3, 1.0, 2.0):
    probs  = softmax_T(logits, T)
    draws  = sample(probs, n=200, rng=rng)
    counts = np.bincount(draws, minlength=len(vocab))
    print(f"T = {T:>3}    {dict(zip(vocab, counts))}")

# Uncomment to try top-k sampling:
# k = 3
# probs = softmax_T(logits, T=1.0)
# top   = np.argsort(probs)[-k:]
# mask  = np.zeros_like(probs); mask[top] = probs[top]
# probs = mask / mask.sum()
# print("top-k:", dict(zip(vocab, sample(probs, 200, rng))))

What to carry forward. Softmax is the pollster — exponentiate to force positivity, normalise to make the percentages sum to one, and you have a real probability distribution. Temperature is the pollster's conviction dial, from landslide to wide-open race. Never implement softmax without the shift-by-max trick unless you enjoy NaN. And in PyTorch the three names you'll reach for are F.softmax (get probabilities), F.log_softmax (better for losses), and nn.CrossEntropyLoss (the fused, production-safe combo).

Next up — Cross-Entropy Loss. The pollster hands you a distribution. Fine. But how wrong is it? You need a single number that's small when the model puts most of its mass on the correct class and large when it confidently picks the wrong one — a loss you can actually minimize. That's cross-entropy, and it's the piece that lets you grade the pollster's work and send gradients back to every parameter in the network. Up next.

References

[01]
Dive into Deep Learning — 3.4 Softmax Regression
Zhang, Lipton, Li, Smola · d2l.ai
[02]
Deep Learning — 6.2.2 Softmax Units
Goodfellow, Bengio, Courville · MIT Press, 2016
[03]
The Curious Case of Neural Text Degeneration
Holtzman, Buys, Du, Forbes, Choi · ICLR 2020 — where nucleus sampling was introduced