What is “Mixture of Experts” (MoE)?

Mixture of Experts (MoE) is a neural-network architecture in which the model is composed of many specialised sub-networks (“experts”) plus a routing network that decides which experts handle each input token. Only a fraction of total parameters activate per inference, allowing much larger overall model capacity at similar inference cost. MoE is the architecture behind Mixtral 8x7B, GPT-4 (rumoured), DeepSeek-V3 and many 2024-2025 frontier models.

MoE mechanics

  • Experts: typically 8-128 separate feed-forward sub-networks per MoE layer.
  • Router (gating network): a small neural network that scores experts for each token and routes to the top-K (often 2).
  • Sparse activation: only K of N experts compute per token; total params high, compute per token bounded.
  • Load balancing: auxiliary losses ensure experts are used roughly equally during training.

MoE vs. dense models

  • Dense (e.g., Llama 3 70B): every parameter active per inference; higher compute cost.
  • MoE (e.g., Mixtral 8x7B has ~47B params but only ~13B active): larger total model, similar inference cost; trade memory for compute.
  • Quality: well-tuned MoE matches or beats dense models of comparable active-parameter count.

MoE deployment considerations

  • Memory footprint: all experts must be in memory even if only a few activate per token; raises hosting cost.
  • Routing instability: early training can produce degenerate routing where all tokens go to one expert.
  • Inference throughput: harder to batch efficiently than dense models due to routing variability.
  • Compliance documentation: AI Act and ISO/IEC 42001 documentation requires architecture transparency including MoE topology.

Do: consider MoE when training compute budget is constrained but inference budget can absorb the memory cost; benchmark against equivalent dense models.
Don’t: assume an MoE with high total parameters is “more capable” than dense models — what matters is active-parameter count and training data quality.