Single Neuron

A weighted sum, a nonlinearity, a prediction.

Easy
~15 min read
·lesson 1 of 6

Picture a doorman outside a club. A line of people walks up, each one wearing a number — their height, their outfit, whether they're on the list, how loudly they're arguing with their friend. The doorman cares about some of those things a lot, some of them a little, and some of them not at all. He adds it all up in his head, factors in how grumpy he happens to be tonight, and lands on one decision: in, or not.

That's a neuron. The whole lesson is that sentence.

You've already built every part of this doorman piece by piece. The weighted sum is linear regression. The squash at the end is the activation. All this lesson does is staple them together, name the result, and show you the one thing a single doorman absolutely cannot do.

Here's the doorman, written down. One line. Every neural network you've ever heard of is this line, a trillion times, wired up.

the neuron — one equation
       ┌──── linear combo ────┐    ┌── nonlinear ──┐

y  =   f (  w₁·x₁  +  w₂·x₂  +  ...  +  w_D·x_D   +   b )

  • xᵢ  — input from the previous layer (or the raw data)
  • wᵢ  — weight on input i — the doorman's opinion of feature i
  • b   — bias — his baseline mood before anyone walks up
  • f   — activation — the yes/no decision rule at the door
  • y   — the single scalar output

Two things live in that equation that the neuron will eventually learn: the weights and the bias. Everything else is fixed.

The weights are opinions. One per feature. A large positive wᵢ means the doorman cares a lot about feature i and likes what high values do. A large negative wᵢ means he cares a lot and dislikes high values. A weight near zero means he doesn't particularly care; that feature could double and he'd shrug. Training is the process of him revising those opinions until his decisions line up with what the labels say they should be.

The bias is his mood. A single scalar that shifts the whole decision up or down regardless of who's in line. Grumpy today? Bias goes down — the bar for getting in rises, and marginal people get turned away. Generous? Bias up — everyone looks a little better than they are. Without a bias, the doorman is forced to say “in” whenever all the inputs happen to be zero, which is not a policy any real establishment would endorse.

Drag the knobs below. Each row is one term of the weighted sum — left slider is the input xᵢ, right slider is its weight wᵢ, and the product xᵢ · wᵢ updates on the right. The pre-activation row sums them plus the bias. Pick an activation from the control bar to see the final decision.

a neuron in motion
y = f(Σᵢ wᵢxᵢ + b)·f = max(0, z)
x
0.80
·
w=0.60
=0.480
x
-0.50
·
w=-0.40
=0.200
x
1.20
·
w=0.90
=1.080
pre-activation
0.480+0.200+1.080+0.10=z = 1.860
activationReLU(z) = max(0, z)y = 1.8600
activation
z1.860
y = f(z)1.8600
Neuron (personified)
I am a scalar. Not an array, a tensor, a model, or a service. I take a weighted combination of whatever walks up, factor in my mood, and squash the result into a decision. When I'm part of a layer I have coworkers — same inputs, different opinions — and together we are a crew. Individually I am one doorman with one set of preferences and one bad day of the week.

Here's the same doorman drawn the way textbooks draw him. Edges carry the inputs in on the left. The body in the middle does the weighted sum plus the bias, then applies the activation. One output comes out on the right. The edges get thicker as their weighted contribution grows — the features he's paying the most attention to right now.

This is the picture you should have in your head every time you see nn.Linear in PyTorch. Each output unit of that layer is one of these diagrams, running in parallel with the others.

the signal flow — why 'synapse' is the right metaphor
edge thickness ∝ |wᵢ · xᵢ|
price0.80reviews0.60rating0.90w = 0.50x·w = 0.40+w = 0.30x·w = 0.18+w = 0.80x·w = 0.72+Σ + b = 0.90sigmoid0.711output0.711
click an edge pill to flip the sign of its weight · hit +/− to nudge
activation
b = -0.40
pre-act z0.900
output y0.7109

Shrink the doorman down to two inputs and wrap him in a sigmoid, and you get a binary classifier. He says class 1 when w₁x₁ + w₂x₂ + b > 0, class 0 otherwise. Geometrically that condition is a straight line cutting the input plane in two: the decision boundary. Points on one side get waved in. Points on the other side get sent home.

Try the three datasets below. Drag w₁, w₂, and b until all four dots are on the right side of the line. AND is easy — the separator is x₁ + x₂ = 1.5. OR is easy too. Then flip to XOR.

decision boundary — one neuron draws one line
w₁x₁ + w₂x₂ + b = 0
dataset
loss0.154
accuracy100%

You will not make XOR work. You can try forever. No combination of w₁, w₂, b exists that puts all four XOR points on the correct side of a line, because the correct regions — (0,0) and (1,1) together, (0,1) and (1,0) together — are not linearly separable. No single straight cut can put two diagonally opposite corners on one side and the other two on the other side. The doorman, to use a technical term, is cooked.

This is the most famous limitation in the history of neural networks. Marvin Minsky and Seymour Papert pointed it out in their 1969 book Perceptrons, and the entire field went quiet for almost two decades afterward. The fix turns out to be almost insultingly simple: hire a second doorman with different opinions, let them talk to each other through a non-linearity, and have a third one read their notes. A crew of bouncers with different opinions can carve up regions a single line can't. That's the whole reason every modern network has more than one neuron in it, and it is quite literally the point of the next several lessons.

XOR (personified)
I separate the odd-parity corners from the even-parity corners. No single line works on me, and one doorman is one line. If you want to classify me, you need a crew. The first layer will learn its own intermediate features — something like “is exactly one input high” — and the output doorman will classify those. I'm the dataset that justified every “deep” network ever built.
Gotchas

Don't forget the bias. A neuron without a bias is a doorman who can't have a bad day — his decision boundary is forced to pass through the origin. That's a weirdly specific constraint, and models that silently lose their bias (via a buggy initialization or a shape mismatch) can train for hours before anyone notices the loss floor is suspiciously high.

The activation is a choice, not a detail. Sigmoid, ReLU, tanh, GELU — they all squash, but they squash differently, and the derivative each one hands back to training is different too. Picking the wrong one doesn't crash the network; it quietly kneecaps it. More on this when we build the training loop.

“Neuron” vocabulary is slippery. Some papers call the scalar output of an activation a neuron. Others call the whole weight vector (one row of W) a neuron. Others call an entire layer a neuron. And “unit” means the same thing as “neuron” in all contexts. They're all describing the same diagram at different scopes. Check scope before arguing.

Three implementations of the same doorman. By now you know the drill: pure Python so nothing hides, NumPy so it scales, PyTorch so it trains.

layer 1 — pure python · neuron_scratch.py
python
import math

def neuron(x, w, b, activation='sigmoid'):
    z = sum(xi * wi for xi, wi in zip(x, w)) + b   # weighted sum + bias — the doorman's tally
    if activation == 'relu':
        return max(0.0, z)
    if activation == 'sigmoid':
        return 1.0 / (1.0 + math.exp(-z))
    if activation == 'tanh':
        return math.tanh(z)
    return z                                       # linear — no decision rule, just a number

x = [0.8, -0.5, 1.2]
w = [0.6, -0.4, 0.9]
b = 0.1

z = sum(xi * wi for xi, wi in zip(x, w)) + b
print(f"pre-activation z = {z:.4f}")
print(f"output y = {neuron(x, w, b, 'sigmoid'):.4f}")
stdout
pre-activation z = 1.1200
output y = 0.7540

The sum(...) line is the weighted combo from the equation, one multiply at a time. One doorman, one person at the door, one decision. Fine for a sketch. Useless at scale.

Now picture a whole night's line. NumPy lets the doorman score every arrival in one pass — same arithmetic, vectorised, running in compiled C instead of a Python loop.

layer 2 — numpy · neuron_numpy.py
python
import numpy as np

def neuron(x, w, b, activation='sigmoid'):
    z = x @ w + b                                  # one dot product — whole batch, one line
    if activation == 'relu':
        return np.maximum(0, z)
    if activation == 'sigmoid':
        return 1.0 / (1.0 + np.exp(-z))
    if activation == 'tanh':
        return np.tanh(z)
    return z

# Same single neuron — but now we can score a whole batch at once.
X = np.array([
    [0.8, -0.5, 1.2],
    [1.0, 0.5, -0.3],
    [-0.2, 0.7, 0.4],
])
w = np.array([0.6, -0.4, 0.9])
b = 0.1

out = neuron(X, w, b, 'sigmoid')
print("outputs:", np.round(out, 4))   # one scalar per example in the batch
pure python → numpy
sum(xi * wi for xi, wi in zip(x, w)) + b←→X @ w + b

one matmul — entire batch, one line

one example at a time←→N examples at once

broadcasting, not a loop

NumPy gives you the forward pass for free. PyTorch gives you the forward pass and a ready-made container that remembers the weights and bias, keeps them on the right device, and — once we get to training — hands back the gradients you need to update them. Same doorman. Fewer moving parts on the page.

layer 3 — pytorch · neuron_pytorch.py
python
import torch
import torch.nn as nn

# nn.Linear(in_features=3, out_features=1) IS one neuron.
# Stacking many outputs (out_features > 1) gives a layer of parallel neurons
# — a crew of doormen with different opinions, listening to the same line.
class OneNeuron(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(3, 1)           # 3 weights + 1 bias, all learnable

    def forward(self, x):
        z = self.fc(x)                       # w·x + b
        return torch.sigmoid(z)              # one-liner activation

# Copy our hand-picked weights in so the demo matches layers 1 and 2.
model = OneNeuron()
with torch.no_grad():
    model.fc.weight.copy_(torch.tensor([[0.6, -0.4, 0.9]]))
    model.fc.bias.copy_(torch.tensor([0.1]))

X = torch.tensor([[0.8, -0.5, 1.2], [1.0, 0.5, -0.3], [-0.2, 0.7, 0.4]])
print(model(X))
stdout
tensor([[0.7540],
        [0.5149],
        [0.5498]], grad_fn=<SigmoidBackward0>)
numpy → pytorch
z = X @ w + b←→nn.Linear(3, 1)

packaged with learnable weights + gradients

1 / (1 + np.exp(-z))←→torch.sigmoid(z)

autograd-aware, GPU-aware

one neuron←→nn.Linear(3, K) ; K neurons in parallel

a layer is a crew — next lesson

Teach one doorman to do AND

In NumPy, initialise w = [0, 0], b = 0. Train the neuron on AND [(0,0,0), (0,1,0), (1,0,0), (1,1,1)] using binary cross-entropy loss and gradient descent at lr=0.5 for 2000 steps. It will converge to something like w ≈ [3, 3], b ≈ -4.5 — a doorman who only says yes when both inputs are high.

Bonus: do the same thing for XOR. Plot the loss curve. It will plateau well above zero and the accuracy will lock at 75% — the best a single line can do on that dataset. That's the empirical proof of Minsky and Papert's theoretical result, in about twenty lines of code.

What to carry forward. A neuron is a doorman: weighted sum of the features (opinions × inputs), plus a bias (his mood), run through an activation (his decision rule). One scalar in, one scalar out. Geometrically he draws a hyperplane and labels one side “in”. That's why a single neuron cannot solve anything that isn't linearly separable — XOR being the canonical counterexample. The cure is a crew: more doormen, in more layers, with non-linearities between them. Stack the diagram you just saw and you have a neural network. Nothing else is added.

Next up — Backpropagation. You have a doorman who can predict. But he's new on the job, his opinions are random, and his mood is untuned. How does he learn? Gradient descent gives you the update rule — w ← w − α · ∂L/∂w — but nothing so far has told you how to compute that partial derivative for every weight in a network of tens of thousands of them. Backpropagation is the mechanical procedure that does exactly this: it walks backwards through the network and assigns every single weight its share of the blame (or credit) for the final loss. It is the algorithm whose correctness is the single most important fact in modern ML.

References