Sparse activation — the next axis of scale
Why sparse activation lets you scale parameters without scaling FLOPs.
The gating network that picks which experts see each token.
Preventing expert collapse — keep every expert busy.
Distributing experts across GPUs at training time.