Glossary

GPU

A massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks.

Silicon also: Compute also: Runtime aka graphics processing unit

A parallel processor with thousands of small cores grouped into streaming multiprocessors, paired with high-bandwidth memory (HBMsiliconStacked DRAM used as the main memory of every modern AI accelerator, with bandwidth in TB/s rather than GB/s and capacity per stack in tens of GB. Open full entry or HBMsiliconStacked DRAM used as the main memory of every modern AI accelerator, with bandwidth in TB/s rather than GB/s and capacity per stack in tens of GB. Open full entry on current data-center parts). The instruction set is built around dense linear algebra: matrix-multiply units (tensor cores on NVIDIA, matrix cores on AMD) execute thousands of fused multiply-adds per cycle.

The GPU is what trained every modern frontierweightsThe current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold. Open full entry and serves nearly every production inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry workload. The architecture’s per-chip memory capacity (80 to 192 GB on 2026 datacenter parts) and per-chip bandwidth (3 to 8 TB/s) are the constraints that shape model architecture choices like GQA, mixture of expertsweightsA model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute. Open full entry expert sizing, and KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry layouts.

NVIDIA dominates the datacenter GPU market in 2026 with the Hopper (H100/H200) and Blackwell (B100/B200) generations. AMD’s MI300X is the main credible competitor for inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry ; Apple Silicon is the consumer local-firstsovereignty-decentralizationAn architecture stance where inference (and increasingly memory and agent state) runs on the user's own device rather than a remote API, prioritizing privacy, latency, and offline operation. Open full entry parallel; the rest of the ecosystem is hyperscaler in-house parts (TPUsiliconGoogle's custom AI accelerator family, used internally for training Gemini and externally via Google Cloud, designed around dense matrix multiplication with a systolic array architecture. Open full entry , Trainium) or startups (Cerebras, Groq, Tenstorrent).

Sources

NVIDIA H100 architecture whitepaper

Back to glossary

GPU

Sources

Mentioned in