Glossary

HBM

Stacked DRAM used as the main memory of every modern AI accelerator, with bandwidth in TB/s rather than GB/s and capacity per stack in tens of GB.

Silicon also: Compute aka high bandwidth memory, HBM3, HBM3e

DRAM stacked vertically and connected to the GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry or accelerator die via a silicon interposer, giving wide-bus access to the memory at very high bandwidth. The H100 SXM has 80 GB of HBM3 at 3.35 TB/s; the H200 has 141 GB of HBM3e at 4.8 TB/s; the Blackwell B200 has 192 GB of HBM3e.

HBM is the single component that drives AI accelerator economics more than any other. Model size and KV cache size scale with capacity, throughputcomputeThe rate at which a model produces output tokens, usually quoted as tokens-per-second per GPU or aggregate, the headline number for serving-cost economics. Open full entry scales with bandwidth, and per-GB cost dominates the bill of materials for high-end accelerators. The HBM supply chain (SK hynix, Samsung, Micron) sits in the same critical-path category as advanced lithography.

The architectural consequence: 2026-era models are designed around HBM constraints. mixture of expertsweightsA model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute. Open full entry sparsifies parameters to fit in capacity; GQAruntimeAn attention variant where multiple query heads share the same key and value heads, reducing KV cache size with little quality cost compared to full multi-head attention. Open full entry and MLAruntimeAn attention variant introduced in DeepSeek-V2 that compresses keys and values through a learned low-rank projection, dramatically shrinking the KV cache. Open full entry shrink the KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry ; FP8siliconAn 8-bit floating-point format used for AI inference and increasingly for training, halving memory and bandwidth versus FP16 with minimal quality loss on most workloads. Open full entry inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry halves the bytes per parameter. None of these would exist in the same form without HBM as the gating resource.

Sources

Mentioned in

Back to glossary