Swap NumPy for autograd and GPUs
Tensors, autograd, modules — the mental model.
Per-sample normalization for transformers.
The CNN-era normalization trick, fully unpacked.
LayerNorm without mean subtraction — why it works.