Glossary

ROCm

AMD's open-source GPU compute stack, the main credible alternative to CUDA, with growing coverage in PyTorch and vLLM but still trailing on kernel maturity and tooling.

Silicon also: Runtime aka Radeon Open Compute

AMD’s GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry compute stack, intended as the open counterpart to CUDAsiliconNVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s. Open full entry . The HIP compatibility layer translates CUDAsiliconNVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s. Open full entry source to ROCm with mechanical edits, the rocBLAS and MIOpen libraries cover the dense and convolution kernels, and PyTorch ships a maintained ROCm backend.

The 2024 wave of MI300X deployments at major cloud providers materially advanced ROCm’s production-readiness. vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry , SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry , and llama.cppruntimeGeorgi Gerganov's C++ inference engine optimized for CPUs and consumer GPUs, the on-device standard and the engine behind Ollama, LM Studio, and most local-first AI products. Open full entry all have working ROCm support in 2026; key kernels (FlashAttentionruntimeAn exact attention algorithm that reorders the computation to avoid materializing the full attention matrix in GPU HBM, giving 2 to 4 times speedup with no quality loss. Open full entry , PagedAttentionruntimeAn attention implementation that manages the KV cache in fixed-size blocks like operating-system virtual memory, eliminating fragmentation and letting many concurrent requests share GPU memory efficiently. Open full entry ) have AMD-tuned variants. inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry on MI300X reaches a meaningful fraction of H100 throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry on many models.

Training remains the harder gap. Many distributed-training stacks have CUDA assumptions baked into their networking, scheduling, and kernel code; ROCm coverage is real but uneven. AMD’s bet is that as the open-source AI community grows tired of NVIDIA pricing, the cost of helping mature ROCm becomes worth paying.

Sources

Mentioned in

Back to glossary