Glossary

mixture of experts

A model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute.

Weights also: Training also: Runtime aka moe, sparse moe

A transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry variant where the feed-forward block is replaced by a set of expert subnetworks plus a small router that picks the top-k experts to run for each token. The model has high total parameter count but activates only a slice of it per forward pass, so capacity and inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry FLOPs become independent dimensions. MixtralweightsMistral AI's MoE model line, with Mixtral 8x7B (the first widely-adopted open mixture-of-experts model) and the larger Mixtral 8x22B as its two flagship releases. Open full entry 8x7B activates roughly 13B of its 47B parameters per token; DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry -V3 activates 37B of 671B.

Mechanically: every token’s hidden state goes through a learned gating function that outputs a softmax over experts; the top-k (usually 1 or 2) experts process the token, and their outputs are weighted-summed. Training is harder than dense models because the router has to learn which experts specialize in what, and idle experts get no gradient. Auxiliary load-balancing losses prevent collapse into one or two dominant experts.

In open-source AI the architecture matters because it offers a way to push capacity without paying linear inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry cost. MixtralweightsMistral AI's MoE model line, with Mixtral 8x7B (the first widely-adopted open mixture-of-experts model) and the larger Mixtral 8x22B as its two flagship releases. Open full entry , DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry , Qwen3-MoE, and LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry 4’s preview all use it. Runtimes serving MoE models need expert parallelism (different GPUs hold different experts) which makes them harder to serve at home but more efficient at fleet scale.

Distinct from dense models where every parameter activates every forward pass, and from model parallelism which splits a dense model across devices rather than routing through learned subnetworks.

Sources

Mentioned in

Back to glossary