Glossary
DeepSpeed
Microsoft's open-source training optimization library, originator of the ZeRO sharding technique and a peer to Megatron for distributed transformer training at scale.
Microsoft’s training optimization library. DeepSpeed introduced ZeRO,
the shardingtrainingA distributed training pattern where parameters, gradients, and optimizer states are split across GPUs (and sometimes hosts) so the total memory footprint scales with the cluster, not with each GPU.
Open full entry technique that lets data-parallel training cross the
memory boundary of any single GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks.
Open full entry by splitting optimizer states,
gradients, and parameters across the data-parallel group.
The library combines easily with PyTorch and with MegatrontrainingNVIDIA's distributed-training framework for large transformer models, providing the reference implementation of tensor parallelism, pipeline parallelism, and 3D parallelism used in many open and closed training runs. Open full entry ’s parallelism abstractions; a typical production training stack runs MegatrontrainingNVIDIA's distributed-training framework for large transformer models, providing the reference implementation of tensor parallelism, pipeline parallelism, and 3D parallelism used in many open and closed training runs. Open full entry ’s tensor-parallel + pipeline-parallel layout combined with DeepSpeed’s shardingtrainingA distributed training pattern where parameters, gradients, and optimizer states are split across GPUs (and sometimes hosts) so the total memory footprint scales with the cluster, not with each GPU. Open full entry . DeepSpeed-inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry adds a separate inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry path with tensor-parallel kernels, though for production inference vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry and SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry have largely displaced it.