Glossary

sharding

A distributed training pattern where parameters, gradients, and optimizer states are split across GPUs (and sometimes hosts) so the total memory footprint scales with the cluster, not with each GPU.

Training also: Compute aka parameter sharding, ZeRO sharding

A distributed training pattern that splits model state across GPUs. The constraints addressed: a 70B-parameter model needs roughly 1.4 TB of memory for parameters + gradients + optimizer state in mixed precision; no single GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry has that. Sharding partitions the state so each GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry holds 1/N of it and the cluster as a whole can train models much larger than any individual node.

ZeRO (DeepSpeedtrainingMicrosoft's open-source training optimization library, originator of the ZeRO sharding technique and a peer to Megatron for distributed transformer training at scale. Open full entry ) and FSDP (PyTorch’s native variant) are the dominant implementations in 2026. Both have multiple stages: stage 1 shards optimizer states only, stage 2 adds gradient sharding, stage 3 adds parameter sharding. Each stage adds memory savings and communication overhead.

The composability with other distribution strategies matters. Sharding works alongside tensor parallelism, pipeline parallelism, data parallelism, and expert parallelism (for mixture of expertsweightsA model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute. Open full entry models). MegatrontrainingNVIDIA's distributed-training framework for large transformer models, providing the reference implementation of tensor parallelism, pipeline parallelism, and 3D parallelism used in many open and closed training runs. Open full entry - LM and the various open distributed-training frameworks all combine these in different “3D parallelism” or “4D parallelism” recipes appropriate to cluster topology and model architecture.

Sources

Mentioned in

Back to glossary