Hardware

Will it fit, and how fast?

Two questions decide whether you can run an open model on a given box. Memory capacity decides what fits. memory bandwidthsiliconThe rate (GB/s or TB/s) at which an accelerator reads its memory. It sets the ceiling on decode tokens/sec, since each token streams the active weights once. Open full entry decides how fast it runs. They are not the same number, and a box that holds a model can still be too slow to serve it.

Pick a model, a quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry , and a context length, then compare specific boxes, find every box a model fits on, or pick a box and list everything it can run. The explorer shows whether the model fits, a theoretical rooflineruntimeA performance model that bounds throughput by either compute or memory bandwidth, whichever is the limiting resource for an operation's arithmetic intensity. Open full entry tokens/sec ceiling with the formula, a realistic range, a measured anchor where one exists, and a plain-language reason that traces back to bandwidth, capacity, and KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry . The method comes straight from the self-host course.

Theoretical numbers are labeled as estimates and show their inputs; measured anchors carry a date and source. Every hardware spec flows through the same verification gate as /models: an unverified value renders as —, never a guess.

The spectrum

Per single unit. Compute figures are dense (never sparse). Bandwidth is the decode-speed determinant; capacity is the fit determinant.

Click a column header to sort. Hover a header for what it means.

Hardware	Class	Memory	Bandwidth	FP16 dense	Power	Released
AMD Instinct MI355X AMD	Datacenter	288 GB HBM3e	8.00 TB/s	2517 TF	1400 W	2025-06
NVIDIA B200 NVIDIA	Datacenter	180 GB HBM3e	7.70 TB/s	2250 TF	1000 W	2025-01
AMD Instinct MI325X AMD	Datacenter	256 GB HBM3e	6.00 TB/s	1307 TF	1000 W	2024-10
AMD Instinct MI300X AMD	Datacenter	192 GB HBM3	5.30 TB/s	1307 TF	750 W	2023-12
NVIDIA H200 SXM NVIDIA	Datacenter	141 GB HBM3e	4.80 TB/s	989 TF	700 W	2024-01
NVIDIA H100 SXM5 NVIDIA	Datacenter	80 GB HBM3	3.35 TB/s	989 TF	700 W	2022-09
NVIDIA H100 PCIe NVIDIA	Datacenter	80 GB HBM2e	2.00 TB/s	756 TF	350 W	2022-09
NVIDIA GeForce RTX 5090 NVIDIA	Workstation	32 GB GDDR7	1.79 TB/s	209.5 TF	575 W	2025-01
NVIDIA RTX PRO 6000 Blackwell NVIDIA	Workstation	96 GB GDDR7	1.79 TB/s	—	600 W	2025-03
NVIDIA GeForce RTX 4090 NVIDIA	Workstation	24 GB GDDR6X	1.01 TB/s	165.2 TF	450 W	2022-10
AMD Radeon AI PRO R9700 AMD	Workstation	32 GB GDDR6	640 GB/s	191 TF	300 W	2025-07
Tenstorrent Blackhole p150a Tenstorrent	Workstation	32 GB GDDR6	512 GB/s	—	300 W	2025-08
Apple Mac Studio (M3 Ultra) Apple	Apple unified	512 GB Unified LPDDR5X	819 GB/s	—	270 W	2025-03
Apple MacBook Pro (M5 Max) Apple	Apple unified	128 GB Unified LPDDR5X	614 GB/s	—	40 W	2026-03
Apple Mac Studio (M4 Max) Apple	Apple unified	128 GB Unified LPDDR5X	546 GB/s	—	160 W	2025-03
Apple MacBook Pro (M4 Max) Apple	Apple unified	128 GB Unified LPDDR5X	546 GB/s	—	40 W	2024-10
Apple Mac mini (M4 Pro) Apple	Apple unified	64 GB Unified LPDDR5X	273 GB/s	—	65 W	2024-10
Apple MacBook Air (M5) Apple	Apple unified	32 GB Unified LPDDR5X	153 GB/s	—	20 W	2026-03
NVIDIA DGX Spark (GB10) NVIDIA	x86 unified	128 GB Unified LPDDR5X	273 GB/s	—	140 W	2025-10
AMD Ryzen AI Max+ 395 (Strix Halo) AMD	x86 unified	128 GB Unified LPDDR5X	256 GB/s	—	120 W	2025-01
Qualcomm Snapdragon X2 Elite Qualcomm	AI PC	48 GB LPDDR5X	228 GB/s	—	80 W	2025-09
Intel Core Ultra 200V (Lunar Lake) Intel	AI PC	32 GB LPDDR5X	136 GB/s	—	37 W	2024-09
Qualcomm Snapdragon X Elite Qualcomm	AI PC	64 GB LPDDR5X	135 GB/s	—	80 W	2024-06

Cross-links: /stack/silicon · the self-host course modules on memory math, quantization, and inference engines.