Sigmoid & ReLU

Activation functions, their derivatives, and why ReLU won.

Easy
~15 min read
·lesson 2 of 6

A neural network is a stack of matrix multiplies. Multiply two matrices, you get a matrix. Stack a thousand of them, you get — a matrix. All that depth, all that compute, and mathematically you have built y = Wx + b. The model, to use a technical term, is cooked.

Something has to bend the line between the layers. A tiny scalar function applied elementwise — nothing fancy, just a kink — and suddenly the network can carve out shapes no single layer could. Those kinks are activation functions, and the whole discipline rests on them.

Two of them run the world. Sigmoid is a dimmer — input goes in, output gets smoothly squashed between 0 and 1, never the rails. ReLU is a light switch — off for anything negative, fully on and proportional for anything positive. One is elegant. The other is crude. Guess which one ate the other's lunch.

Activation function (personified)
I'm the bend in the plane. Without me, your thousand-layer network is a hundred-line Excel formula. With me, it can learn anything. I ask very little: just that my derivative stays alive where it matters.

Five activations under one roof. Drag x, read off the output and the derivative. Watch the derivative curves — they are not decoration. They are the single number that decides whether your network trains or sits there doing nothing.

activation playground
σ(x) = 1 / (1 + e⁻ˣ)·σ'(x) = σ(x)·(1 − σ(x))
Smooth. Squashed. Saturates — derivative vanishes in the tails.
f(x)0.6457
f'(x)0.2288

Start with sigmoid — the dimmer. For a decade it was the default activation, and it is still the right choice at the output of a binary classifier, because its codomain is exactly (0, 1). Which is convenient when you want a probability.

sigmoid — the classic squash
σ(x)   =    1
         ───────────
          1 + e⁻ˣ

Picture what the formula does. Feed it +10 — the e⁻ˣ collapses to almost zero, output snaps to 1. Feed it −10 — the e⁻ˣ explodes, output snaps to 0. Feed it anything in the middle and you get a smooth ride between the two. That's the dimmer — the knob never quite reaches the rails, but it leans hard toward them in the tails.

The derivative has an unusually pretty form. You can write it as a function of itself:

derivative, in closed form
σ'(x)  =  σ(x) · (1 − σ(x))

That one-liner is why sigmoid is cheap: if you already computed σ(x) on the forward pass, the derivative is one multiply away. Here is the proof — one line of calculus, because if we don't derive it ourselves we'll keep being surprised by it.

a very short proof
σ(x)   =  (1 + e⁻ˣ)⁻¹

σ'(x)  =  −(1 + e⁻ˣ)⁻² · (−e⁻ˣ)                    chain rule

       =   e⁻ˣ / (1 + e⁻ˣ)²

       =   1/(1 + e⁻ˣ)  ·  e⁻ˣ/(1 + e⁻ˣ)

       =   σ(x)   ·   (1 − σ(x))

Plug x = 0: σ(0) = 0.5, so σ'(0) = 0.25. That is the maximum of the derivative. Every other x gives something smaller. At x = ±5 you're down to 0.0066. At x = ±10, 4.5 × 10⁻⁵. The dimmer has bottomed out — turn it further and nothing moves.

Now stack. In a deep network the effective gradient at layer k is the product of derivatives through layers 1 to k. Every layer contributes a factor of at most 0.25. Twenty layers deep, you are multiplying twenty numbers each ≤ 0.25 together. That is the vanishing gradient problem, and it is not hand-wavy worry — it is arithmetic.

saturation — expected gradient magnitude through k layers
log₁₀ scale · signal ~ N(0, 1)
sigmoid2.0e-14
tanh4.4e-5
relu9.5e-7

Move the depth slider. By layer 15 the sigmoid curve has dropped below 10⁻⁹. By layer 25, below 10⁻¹⁴ — under the precision of a 32-bit float. The deeper layers don't minimize anything because no signal reaches them to minimize against. ReLU's curve stays flat. That is the whole chart. That is why sigmoid-as-hidden-activation fell out of fashion circa 2011.

Sigmoid (personified)
I was the dominant activation for twenty years. Then Krizhevsky, Sutskever, and Hinton trained an 8-layer ReLU network on ImageNet in 2012, crushed the entire field, and I was effectively retired from hidden layers within 18 months. I am still a fine output layer for binary classification. Everywhere else, please don't call me.

Meet the successor. Mathematically, it is a joke — and that is the point. No e. No chain rule. A light switch.

ReLU — rectified linear unit
ReLU(x)   =   max(0, x)

ReLU'(x)  =   { 1  if x > 0
               { 0  if x ≤ 0

That is the spec. For x > 0 the switch is on — signal passes through unchanged, gradient passes through unchanged. For x ≤ 0 the switch is off — signal is zero, gradient is zero. No in-between. The dimmer had a smooth ride between 0 and 1; the switch has two states.

Three things fall out of the spec.

  • Speed. max(0, x) is a comparison and maybe a zero — one instruction. Compare to 1 / (1 + exp(−x)) — a divide and an exp. On a GPU shoving billions of activations per step, the difference pays for the whole paper.
  • No saturation on the positive side. When the switch is on, the derivative is a flat 1. Gradients propagate backward through active neurons without any arithmetic decay. Stack ReLU layers arbitrarily deep — the sigmoid death spiral doesn't happen.
  • Sparsity. Roughly half of ReLU's outputs are zero on typical input. Those neurons contribute nothing to that particular forward pass. Think of it as the network deciding, per-input, which neurons get to participate.

Here is the catch, and here is the payoff for the cliffhanger at the end of the last lesson. The question was: one of the two activations has a bad habit of murdering gradients entirely. Sigmoid saturates — its gradient shrinks toward zero in the tails, the dimmer bottoms out. That is bad. ReLU does something worse.

If a ReLU neuron's pre-activation is always negative across the whole training set, it outputs zero on every example, and its derivative is zero on every example. Which means the gradient on its incoming parameters is zero on every example. Which means those parameters never update. Which means it will be stuck in the off position forever. That is a dead ReLU — a light switch broken in the always-off position, unrecoverable, dead weight in the network.

How often does this happen? Depends on initialization. A small, correctly-scaled init barely produces any. A cold, overly-negative bias can kill a quarter of your neurons on the first step and leave them dead for the rest of training. Play with it.

dead neurons — the ReLU failure mode
12×12 neurons·batch of 64
activedead (ReLU — unrecoverable)asleep (Leaky ReLU — still learning)
alive126
dead18

Drag bias μ toward −3. The grid goes dark — most neurons never fire on any example in the batch, their gradient is zero, they will sit at zero forever. Now click Leaky ReLU. Nothing is fully dead anymore — the dim cells still have a small gradient (the 0.1 slope on the negative side), so an unlucky neuron can still claw its way back into usefulness. That is the fix. It is a one-character change in code.

In practice modern networks use careful initialization (He init, coming in a later lesson) to keep most ReLUs alive, and sometimes swap in Leaky ReLU or GELU when dying is a real concern. But the failure mode is real, and the mitigation is worth knowing.

Gotchas

Sigmoid in hidden layers: almost always wrong today. Use ReLU / Leaky ReLU / GELU. Sigmoid is fine at the output of a binary classifier; it is not fine between layers 3 and 4 of a ResNet.

ReLU at the output: only makes sense if the target is non-negative (e.g. predicting a count). Otherwise the network can never predict a negative value, which is usually not what you want.

Zero-centered inputs: sigmoid outputs live in (0, 1) — the mean of activations is positive. This pushes gradients in one direction and slows training. Tanh fixes that (outputs are mean-zero) but still saturates in the tails. ReLU doesn't care.

You've seen the switch and the dimmer. You've seen the math. Now write both three times — each shorter than the last, and the third one ships with autograd. Pure Python, NumPy, PyTorch. Nobody implements these by hand in production. Knowing what's underneath is the point.

activations_scratch.py
import math

def sigmoid(x):
    return 1.0 / (1.0 + math.exp(-x))

def sigmoid_deriv(x):
    s = sigmoid(x)
    return s * (1.0 - s)             # f'(x) = f(x)(1 - f(x))

def relu(x):
    return x if x > 0 else 0.0       # max(0, x)

def relu_deriv(x):
    return 1.0 if x > 0 else 0.0

print(f"σ(0.5)={sigmoid(0.5):.4f}  σ'(0.5)={sigmoid_deriv(0.5):.4f}")
print(f"ReLU(-1.2)={relu(-1.2)}  ReLU'(-1.2)={relu_deriv(-1.2)}")
stdout
σ(0.5)=0.6225  σ'(0.5)=0.2350
ReLU(-1.2)=0.0  ReLU'(-1.2)=0
pure python → numpy
for each scalar x: math.exp(-x)←→np.exp(-x) # vector-in, vector-out

one call replaces the Python loop

1.0 if x > 0 else 0.0←→(x > 0).astype(float)

elementwise comparison — yields a boolean mask

max(0, x)←→np.maximum(0, x)

broadcasted max, no branches

numpy → pytorch
sigmoid(x) = 1/(1+np.exp(-x))←→torch.sigmoid(x)

same math, tracked for autograd, runs on GPU

np.maximum(0, x)←→F.relu(x) or x.clamp_min(0)

canonical PyTorch call — both compile identically

relu_deriv(x) = (x > 0).astype(float)←→loss.backward()

you never write this — autograd traces it from F.relu

Kill every ReLU on purpose

Build a tiny network in PyTorch with a single hidden layer of 128 ReLU neurons. Set the bias init to -3.0 (very negative) and the input standard deviation to 0.1 (tiny). Run one forward pass on a batch of 64.

Count how many of the 128 neurons produced a non-zero output on any example. That is your “alive” count. With these settings most will be dead — every one of those switches jammed in the off position for the rest of training. Now swap in nn.LeakyReLU(0.1) and re-run. Every neuron will have at least a small gradient.

Bonus: put the network into a training loop for 200 steps. Plot the alive-count against step. Watch the ReLU neurons stay dead and the Leaky ReLU neurons recover.

What to carry forward. An activation's derivative is a multiplier that backprop applies at every layer — if it's too small or zero, gradients die. Sigmoid is the dimmer: it saturates in the tails, its derivative caps at 0.25, and multiplying twenty numbers ≤ 0.25 together is lethal. ReLU is the light switch: its derivative is a flat 1 when on, which is why it scaled. Its failure mode — the switch stuck off — is real but mostly solved by careful init and occasional Leaky / GELU swaps.

Next up — Softmax. Activations inside the network, done. The last thing a classifier does is take a vector of raw scores (“logits”) and turn them into a probability distribution over classes. That is softmax — the sigmoid's older sibling, generalised to k outputs. It is also the function cross-entropy loss is defined against, which makes it the most load-bearing piece of math you'll meet this section. And it hides a numerical trap that blows up models in production. We'll find it.

References