Glossary

unified memory

A single physical memory pool shared by CPU and GPU, so the full capacity is usable as model memory; used by Apple Silicon, Strix Halo, and DGX Spark.

Silicon also: Compute also: Runtime aka unified memory architecture, uma

Unified memory puts the CPU and GPU on one physical memory pool instead of giving the GPU its own separate VRAM. For local inference this removes the VRAM capacity ceiling that discrete consumer cards hit: a Mac Studio with 512 GB of unified memory can hold a model that no single consumer GPU can, because the whole pool is addressable as model memory.

The tradeoff is memory bandwidthsiliconThe rate (GB/s or TB/s) at which an accelerator reads its memory. It sets the ceiling on decode tokens/sec, since each token streams the active weights once. Open full entry . Unified-memory parts use LPDDR5XsiliconLow-power DRAM used as unified memory in Apple Silicon, DGX Spark, and Strix Halo. High capacity and efficiency, with bandwidth below HBM and GDDR. Open full entry rather than HBMsiliconStacked DRAM used as the main memory of every modern AI accelerator, with bandwidth in TB/s rather than GB/s and capacity per stack in tens of GB. Open full entry or GDDR7siliconThe graphics memory generation on 2025-era consumer and workstation GPUs such as the RTX 5090 and RTX PRO 6000. High bandwidth per board, lower capacity than HBM. Open full entry , so their bandwidth runs lower than a discrete GPU’s, which caps decode tokens/sec even when the model fits comfortably. The result is the classic split: unified memory wins on “can I run this at all,” discrete GPUs win on “how fast does it serve.”

The category spans Apple Silicon (M-series), x86 appliances like AMD’s Strix Halo (Ryzen AI Max), and NVIDIA’s DGX Spark. They are the practical way to run large models on a single quiet box, and the reason capacity and bandwidth must be read as two separate specs when comparing them.

Sources

Mentioned in

Back to glossary