Glossary

expert parallelism

A parallelism strategy for mixture-of-experts models where different GPUs hold different experts; requires all-to-all communication on every token routing step.

Runtime also: Training also: Infrastructure aka expert-parallelism, ep, moe parallelism

The parallelism strategy that mixture of expertsweightsA model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute. Open full entry models use to spread their experts across multiple GPUs. Each GPU holds a subset of the experts; on every forward pass, tokens are routed to the experts that should process them, and the GPUs do an all-to-all collective to exchange tokens with their assigned experts.

Expert parallelism’s strength is that it makes large MoE models serveable on a fleet. DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry V3’s 671B total parameters across 256 experts can’t fit on a single GPU; spread across an 8xH100 node with expert parallelism, each GPU holds 32 experts (~85B parameters) and the all-to-all shuffles tokens to the right experts per step. mixture of expertsweightsA model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute. Open full entry 8x7B similarly benefits.

The weakness is interconnect: all-to-all traffic is denser than tensor parallelism’s all-reduce because every GPU sends data to every other GPU on every routing step. On nodes without NVLinkcomputeNVIDIA's proprietary GPU-to-GPU interconnect, providing bandwidth an order of magnitude above PCIe and the basis for tightly-coupled 8-GPU server nodes (DGX, HGX). Open full entry or NVSwitch, the all-to-all becomes the bottleneck and expert parallelism can be slower than other strategies. DeepSeek V3’s published recipe explicitly trades off expert parallelism, tensor parallelism, and pipeline parallelism depending on the interconnect topology. Production serving engines (vLLM, SGLang, TensorRT-LLM) implement expert parallelism as one of several parallelism modes the operator picks at deployment.

Sources

Back to glossary