ml ›section 11 of 14

Mixture of Experts

Sparse activation — the next axis of scale

4 lessons·1medium3hard

Lessons

in order

Why sparse activation lets you scale parameters without scaling FLOPs.

The gating network that picks which experts see each token.

Preventing expert collapse — keep every expert busy.

Distributing experts across GPUs at training time.