What is “Mixture of Experts” (MoE)?
Mixture of Experts (MoE) is a neural-network architecture in which the model is composed of many specialised sub-networks (“experts”) plus a routing network that decides which experts handle each input token. Only a fraction of total parameters activate per inference, allowing much larger overall model capacity at similar inference cost. MoE is the architecture behind Mixtral 8x7B, GPT-4 (rumoured), DeepSeek-V3 and many 2024-2025 frontier models.
MoE mechanics
- Experts: typically 8-128 separate feed-forward sub-networks per MoE layer.
- Router (gating network): a small neural network that scores experts for each token and routes to the top-K (often 2).
- Sparse activation: only K of N experts compute per token; total params high, compute per token bounded.
- Load balancing: auxiliary losses ensure experts are used roughly equally during training.
MoE vs. dense models
- Dense (e.g., Llama 3 70B): every parameter active per inference; higher compute cost.
- MoE (e.g., Mixtral 8x7B has ~47B params but only ~13B active): larger total model, similar inference cost; trade memory for compute.
- Quality: well-tuned MoE matches or beats dense models of comparable active-parameter count.
MoE deployment considerations
- Memory footprint: all experts must be in memory even if only a few activate per token; raises hosting cost.
- Routing instability: early training can produce degenerate routing where all tokens go to one expert.
- Inference throughput: harder to batch efficiently than dense models due to routing variability.
- Compliance documentation: AI Act and ISO/IEC 42001 documentation requires architecture transparency including MoE topology.
Do: consider MoE when training compute budget is constrained but inference budget can absorb the memory cost; benchmark against equivalent dense models.
Don’t: assume an MoE with high total parameters is “more capable” than dense models — what matters is active-parameter count and training data quality.