Glossary

memory bandwidth

The rate (GB/s or TB/s) at which an accelerator reads its memory. It sets the ceiling on decode tokens/sec, since each token streams the active weights once.

Silicon also: Runtime also: Compute aka bandwidth

Memory bandwidth is how fast an accelerator can move data between its memory and its compute units, quoted in GB/s or TB/s. It is distinct from memory capacity: capacity decides whether a model fits, bandwidth decides how fast it runs. People who collapse the two into one number buy boxes that hold a model but serve it slowly.

Bandwidth is the binding constraint for decoderuntimeThe second phase of LLM inference, generating one token at a time from the KV cache. Memory-bandwidth-bound; throughput tracks memory bandwidth more than peak compute. Open full entry because generating each token streams the model’s active weights through the compute units once. A 70B dense model at FP16siliconA 16-bit floating-point format used as the default precision for deep learning training and inference, halving memory versus FP32 with small quality cost on most workloads. Open full entry moves about 140 GB per token, so a 1.8 TB/s card caps near 12 tokens per second on memory motion alone, while a 250 GB/s unified-memory box caps far lower on the same work. The KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry is streamed each step too, so long context raises the per-token bytes and drags speed down further.

The 2026 spectrum spans roughly two orders of magnitude: datacenter HBMsiliconStacked DRAM used as the main memory of every modern AI accelerator, with bandwidth in TB/s rather than GB/s and capacity per stack in tens of GB. Open full entry parts clear 3 to 8 TB/s, workstation GDDR7siliconThe graphics memory generation on 2025-era consumer and workstation GPUs such as the RTX 5090 and RTX PRO 6000. High bandwidth per board, lower capacity than HBM. Open full entry cards sit near 1.8 TB/s, Apple and x86 unified-memory parts run 250 to 820 GB/s, and AI-PC laptops sit around 100 to 250 GB/s. Picking by capacity alone, without checking the bandwidth tier, is the most common self-hosting mistake.

Sources

Memory Bandwidth for Local AI Hardware (2026 Edition), Ahmad Osman

Mentioned in

Back to glossary