Glossary
RDMA
A networking technique that lets a remote machine read or write local memory without involving the CPU, foundational for high-throughput distributed training over InfiniBand or RoCE.
A data-transfer mechanism where the network adapter writes directly into the remote machine’s memory through DMA, bypassing the kernel and the receiving CPU. latencycomputeThe time from request submission to response completion, broken down for LLMs into time-to-first-token and time-per-output-token, the user-facing speed metric. Open full entry drops by an order of magnitude compared to TCP/IP; throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry approaches the wire speed.
For AI training, RDMA is what makes cross-node collective operations (all-reduce, all-gather) viable. The NCCL library that PyTorch, DeepSpeedtrainingMicrosoft's open-source training optimization library, originator of the ZeRO sharding technique and a peer to Megatron for distributed transformer training at scale. Open full entry , and MegatrontrainingNVIDIA's distributed-training framework for large transformer models, providing the reference implementation of tensor parallelism, pipeline parallelism, and 3D parallelism used in many open and closed training runs. Open full entry use to synchronize gradients depends on RDMA under the hood. Without it, the network would dominate iteration time.
Two implementations matter in practice. InfiniBandcomputeA high-throughput, low-latency network fabric (Mellanox, now NVIDIA) used for inter-node communication in AI training clusters, supporting RDMA for direct GPU-to-GPU transfer across machines. Open full entry -native RDMA is the default on most large training clusters. RoCE (RDMA over Converged Ethernet) runs the same semantics over standard Ethernet hardware, trading some performance for lower cost and easier operations. Most hyperscaler in-house clusters use RoCE; most research-lab clusters use InfiniBandcomputeA high-throughput, low-latency network fabric (Mellanox, now NVIDIA) used for inter-node communication in AI training clusters, supporting RDMA for direct GPU-to-GPU transfer across machines. Open full entry ; the lines are blurring.