Sigmoid & ReLU
Activation functions, their derivatives, and why ReLU won.
A neural network is a stack of matrix multiplies. Multiply two matrices, you get a matrix. Stack a thousand of them, you get — a matrix. All that depth, all that compute, and mathematically you have built y = Wx + b. The model, to use a technical term, is cooked.
Something has to bend the line between the layers. A tiny scalar function applied elementwise — nothing fancy, just a kink — and suddenly the network can carve out shapes no single layer could. Those kinks are activation functions, and the whole discipline rests on them.
Two of them run the world. Sigmoid is a dimmer — input goes in, output gets smoothly squashed between 0 and 1, never the rails. ReLU is a light switch — off for anything negative, fully on and proportional for anything positive. One is elegant. The other is crude. Guess which one ate the other's lunch.
I'm the bend in the plane. Without me, your thousand-layer network is a hundred-line Excel formula. With me, it can learn anything. I ask very little: just that my derivative stays alive where it matters.
Five activations under one roof. Drag x, read off the output and the derivative. Watch the derivative curves — they are not decoration. They are the single number that decides whether your network trains or sits there doing nothing.
Start with sigmoid — the dimmer. For a decade it was the default activation, and it is still the right choice at the output of a binary classifier, because its codomain is exactly (0, 1). Which is convenient when you want a probability.
σ(x) = 1
───────────
1 + e⁻ˣPicture what the formula does. Feed it +10 — the e⁻ˣ collapses to almost zero, output snaps to 1. Feed it −10 — the e⁻ˣ explodes, output snaps to 0. Feed it anything in the middle and you get a smooth ride between the two. That's the dimmer — the knob never quite reaches the rails, but it leans hard toward them in the tails.
The derivative has an unusually pretty form. You can write it as a function of itself:
σ'(x) = σ(x) · (1 − σ(x))
That one-liner is why sigmoid is cheap: if you already computed σ(x) on the forward pass, the derivative is one multiply away. Here is the proof — one line of calculus, because if we don't derive it ourselves we'll keep being surprised by it.
σ(x) = (1 + e⁻ˣ)⁻¹
σ'(x) = −(1 + e⁻ˣ)⁻² · (−e⁻ˣ) chain rule
= e⁻ˣ / (1 + e⁻ˣ)²
= 1/(1 + e⁻ˣ) · e⁻ˣ/(1 + e⁻ˣ)
= σ(x) · (1 − σ(x))Plug x = 0: σ(0) = 0.5, so σ'(0) = 0.25. That is the maximum of the derivative. Every other x gives something smaller. At x = ±5 you're down to 0.0066. At x = ±10, 4.5 × 10⁻⁵. The dimmer has bottomed out — turn it further and nothing moves.
Now stack. In a deep network the effective gradient at layer k is the product of derivatives through layers 1 to k. Every layer contributes a factor of at most 0.25. Twenty layers deep, you are multiplying twenty numbers each ≤ 0.25 together. That is the vanishing gradient problem, and it is not hand-wavy worry — it is arithmetic.
Move the depth slider. By layer 15 the sigmoid curve has dropped below 10⁻⁹. By layer 25, below 10⁻¹⁴ — under the precision of a 32-bit float. The deeper layers don't minimize anything because no signal reaches them to minimize against. ReLU's curve stays flat. That is the whole chart. That is why sigmoid-as-hidden-activation fell out of fashion circa 2011.
I was the dominant activation for twenty years. Then Krizhevsky, Sutskever, and Hinton trained an 8-layer ReLU network on ImageNet in 2012, crushed the entire field, and I was effectively retired from hidden layers within 18 months. I am still a fine output layer for binary classification. Everywhere else, please don't call me.
Meet the successor. Mathematically, it is a joke — and that is the point. No e. No chain rule. A light switch.
ReLU(x) = max(0, x)
ReLU'(x) = { 1 if x > 0
{ 0 if x ≤ 0That is the spec. For x > 0 the switch is on — signal passes through unchanged, gradient passes through unchanged. For x ≤ 0 the switch is off — signal is zero, gradient is zero. No in-between. The dimmer had a smooth ride between 0 and 1; the switch has two states.
Three things fall out of the spec.
- Speed.
max(0, x)is a comparison and maybe a zero — one instruction. Compare to1 / (1 + exp(−x))— a divide and an exp. On a GPU shoving billions of activations per step, the difference pays for the whole paper. - No saturation on the positive side. When the switch is on, the derivative is a flat
1. Gradients propagate backward through active neurons without any arithmetic decay. Stack ReLU layers arbitrarily deep — the sigmoid death spiral doesn't happen. - Sparsity. Roughly half of ReLU's outputs are zero on typical input. Those neurons contribute nothing to that particular forward pass. Think of it as the network deciding, per-input, which neurons get to participate.
Here is the catch, and here is the payoff for the cliffhanger at the end of the last lesson. The question was: one of the two activations has a bad habit of murdering gradients entirely. Sigmoid saturates — its gradient shrinks toward zero in the tails, the dimmer bottoms out. That is bad. ReLU does something worse.
If a ReLU neuron's pre-activation is always negative across the whole training set, it outputs zero on every example, and its derivative is zero on every example. Which means the gradient on its incoming parameters is zero on every example. Which means those parameters never update. Which means it will be stuck in the off position forever. That is a dead ReLU — a light switch broken in the always-off position, unrecoverable, dead weight in the network.
How often does this happen? Depends on initialization. A small, correctly-scaled init barely produces any. A cold, overly-negative bias can kill a quarter of your neurons on the first step and leave them dead for the rest of training. Play with it.
Drag bias μ toward −3. The grid goes dark — most neurons never fire on any example in the batch, their gradient is zero, they will sit at zero forever. Now click Leaky ReLU. Nothing is fully dead anymore — the dim cells still have a small gradient (the 0.1 slope on the negative side), so an unlucky neuron can still claw its way back into usefulness. That is the fix. It is a one-character change in code.
In practice modern networks use careful initialization (He init, coming in a later lesson) to keep most ReLUs alive, and sometimes swap in Leaky ReLU or GELU when dying is a real concern. But the failure mode is real, and the mitigation is worth knowing.
Sigmoid in hidden layers: almost always wrong today. Use ReLU / Leaky ReLU / GELU. Sigmoid is fine at the output of a binary classifier; it is not fine between layers 3 and 4 of a ResNet.
ReLU at the output: only makes sense if the target is non-negative (e.g. predicting a count). Otherwise the network can never predict a negative value, which is usually not what you want.
Zero-centered inputs: sigmoid outputs live in (0, 1) — the mean of activations is positive. This pushes gradients in one direction and slows training. Tanh fixes that (outputs are mean-zero) but still saturates in the tails. ReLU doesn't care.
You've seen the switch and the dimmer. You've seen the math. Now write both three times — each shorter than the last, and the third one ships with autograd. Pure Python, NumPy, PyTorch. Nobody implements these by hand in production. Knowing what's underneath is the point.
import math
def sigmoid(x):
return 1.0 / (1.0 + math.exp(-x))
def sigmoid_deriv(x):
s = sigmoid(x)
return s * (1.0 - s) # f'(x) = f(x)(1 - f(x))
def relu(x):
return x if x > 0 else 0.0 # max(0, x)
def relu_deriv(x):
return 1.0 if x > 0 else 0.0
print(f"σ(0.5)={sigmoid(0.5):.4f} σ'(0.5)={sigmoid_deriv(0.5):.4f}")
print(f"ReLU(-1.2)={relu(-1.2)} ReLU'(-1.2)={relu_deriv(-1.2)}")σ(0.5)=0.6225 σ'(0.5)=0.2350 ReLU(-1.2)=0.0 ReLU'(-1.2)=0
import numpy as np
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x)) # elementwise on any array
def sigmoid_deriv_from_output(s):
return s * (1.0 - s) # pass in the already-computed σ(x)
def relu(x):
return np.maximum(0.0, x) # branchless — the trick on CPUs + GPUs
def relu_deriv(x):
return (x > 0).astype(x.dtype) # 1 where x>0, 0 elsewhere
x = np.array([-2.0, -0.5, 0.0, 0.5, 2.0])
print("σ(x) =", np.round(sigmoid(x), 4))
print("ReLU(x) =", relu(x))import torch
import torch.nn.functional as F
x = torch.tensor([-2.0, -0.5, 0.0, 0.5, 2.0], requires_grad=True)
sig = torch.sigmoid(x) # same as F.sigmoid — elementwise σ(x)
rel = F.relu(x) # same as x.clamp_min(0)
print("after sigmoid:", torch.round(sig.detach(), decimals=4))
print("after relu: ", rel.detach())
# Autograd handles the derivatives — no hand-rolled backward required.
loss = rel.sum()
loss.backward() # gradient of sum(relu(x)) w.r.t. x is relu_deriv(x)
print("grad on relu input:", x.grad)after sigmoid: tensor([0.1192, 0.3775, 0.5000, 0.6225, 0.8808]) after relu: tensor([0.0000, 0.0000, 0.0000, 0.5000, 2.0000]) grad on relu input: tensor([0., 0., 0., 1., 1.])
for each scalar x: math.exp(-x)←→np.exp(-x) # vector-in, vector-out— one call replaces the Python loop
1.0 if x > 0 else 0.0←→(x > 0).astype(float)— elementwise comparison — yields a boolean mask
max(0, x)←→np.maximum(0, x)— broadcasted max, no branches
sigmoid(x) = 1/(1+np.exp(-x))←→torch.sigmoid(x)— same math, tracked for autograd, runs on GPU
np.maximum(0, x)←→F.relu(x) or x.clamp_min(0)— canonical PyTorch call — both compile identically
relu_deriv(x) = (x > 0).astype(float)←→loss.backward()— you never write this — autograd traces it from F.relu
Build a tiny network in PyTorch with a single hidden layer of 128 ReLU neurons. Set the bias init to -3.0 (very negative) and the input standard deviation to 0.1 (tiny). Run one forward pass on a batch of 64.
Count how many of the 128 neurons produced a non-zero output on any example. That is your “alive” count. With these settings most will be dead — every one of those switches jammed in the off position for the rest of training. Now swap in nn.LeakyReLU(0.1) and re-run. Every neuron will have at least a small gradient.
Bonus: put the network into a training loop for 200 steps. Plot the alive-count against step. Watch the ReLU neurons stay dead and the Leaky ReLU neurons recover.
What to carry forward. An activation's derivative is a multiplier that backprop applies at every layer — if it's too small or zero, gradients die. Sigmoid is the dimmer: it saturates in the tails, its derivative caps at 0.25, and multiplying twenty numbers ≤ 0.25 together is lethal. ReLU is the light switch: its derivative is a flat 1 when on, which is why it scaled. Its failure mode — the switch stuck off — is real but mostly solved by careful init and occasional Leaky / GELU swaps.
Next up — Softmax. Activations inside the network, done. The last thing a classifier does is take a vector of raw scores (“logits”) and turn them into a probability distribution over classes. That is softmax — the sigmoid's older sibling, generalised to k outputs. It is also the function cross-entropy loss is defined against, which makes it the most load-bearing piece of math you'll meet this section. And it hides a numerical trap that blows up models in production. We'll find it.
- [01]Krizhevsky, Sutskever, Hinton · NeurIPS 2012 — the AlexNet paper
- [02]Nair, Hinton · ICML 2010
- [03]Hendrycks, Gimpel · 2016
- [04]Glorot, Bengio · AISTATS 2010 — the Xavier init paper