Glossary

CUDA

NVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s.

Silicon also: Runtime also: Training aka Compute Unified Device Architecture

The C++ extension, runtime, and library stack that lets developers program NVIDIA GPUs. CUDA has been NVIDIA’s strategic moat for 15+ years: every PyTorch, TensorFlow, JAX, and inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry -runtime implementation targets it as the first-class backend. Tens of thousands of optimized kernels (cuBLAS, cuDNN, FlashAttentionruntimeAn exact attention algorithm that reorders the computation to avoid materializing the full attention matrix in GPU HBM, giving 2 to 4 times speedup with no quality loss. Open full entry , TransformerEngine) are tuned for it.

CUDA’s lock-in is one of the most-discussed structural issues in open AI. Training and inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry code written for CUDA does not run on AMD, Intel, Apple, or RISC-VsiliconAn open instruction set architecture, royalty-free and modular, increasingly used in AI accelerator cores (Tenstorrent, SiFive Intelligence) as the open alternative to ARM and x86. Open full entry silicon without a translation layer (ROCmsiliconAMD's open-source GPU compute stack, the main credible alternative to CUDA, with growing coverage in PyTorch and vLLM but still trailing on kernel maturity and tooling. Open full entry ’s HIP compatibility, OpenCL, OneAPI, MLXruntimeApple's open-source ML framework designed for Apple Silicon's unified memory architecture, the local-first inference engine for Mac and increasingly iPad and iPhone. Open full entry ). Each translation loses performance or coverage.

The open-ecosystem response has two threads. AMD ROCmsiliconAMD's open-source GPU compute stack, the main credible alternative to CUDA, with growing coverage in PyTorch and vLLM but still trailing on kernel maturity and tooling. Open full entry targets CUDA source compatibility directly. Triton (OpenAI 2021) provides a higher- level kernel-authoring language that compiles to CUDA, ROCm, or custom backends. Whether either credibly displaces direct CUDA usage remains an open question.

Sources

NVIDIA CUDA Toolkit documentation

Mentioned in

Back to glossary