The Open-Source AI Stack
RSS

Hardware

Will it fit, and how fast?

Two questions decide whether you can run an open model on a given box. Memory capacity decides what fits. memory bandwidthsiliconThe rate (GB/s or TB/s) at which an accelerator reads its memory. It sets the ceiling on decode tokens/sec, since each token streams the active weights once. Open full entry decides how fast it runs. They are not the same number, and a box that holds a model can still be too slow to serve it.

Pick a model, a quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry , and a context length, then compare specific boxes, find every box a model fits on, or pick a box and list everything it can run. The explorer shows whether the model fits, a theoretical rooflineruntimeA performance model that bounds throughput by either compute or memory bandwidth, whichever is the limiting resource for an operation's arithmetic intensity. Open full entry tokens/sec ceiling with the formula, a realistic range, a measured anchor where one exists, and a plain-language reason that traces back to bandwidth, capacity, and KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry . The method comes straight from the self-host course.

Theoretical numbers are labeled as estimates and show their inputs; measured anchors carry a date and source. Every hardware spec flows through the same verification gate as /models: an unverified value renders as —, never a guess.

The spectrum

Per single unit. Compute figures are dense (never sparse). Bandwidth is the decode-speed determinant; capacity is the fit determinant.

Click a column header to sort. Hover a header for what it means.

Hardware Class Memory Bandwidth FP16 dense Power Released
AMD Instinct MI355X AMD Datacenter 288 GB HBM3e 8.00 TB/s 2517 TF 1400 W 2025-06
NVIDIA B200 NVIDIA Datacenter 180 GB HBM3e 7.70 TB/s 2250 TF 1000 W 2025-01
AMD Instinct MI325X AMD Datacenter 256 GB HBM3e 6.00 TB/s 1307 TF 1000 W 2024-10
AMD Instinct MI300X AMD Datacenter 192 GB HBM3 5.30 TB/s 1307 TF 750 W 2023-12
NVIDIA H200 SXM NVIDIA Datacenter 141 GB HBM3e 4.80 TB/s 989 TF 700 W 2024-01
NVIDIA H100 SXM5 NVIDIA Datacenter 80 GB HBM3 3.35 TB/s 989 TF 700 W 2022-09
NVIDIA H100 PCIe NVIDIA Datacenter 80 GB HBM2e 2.00 TB/s 756 TF 350 W 2022-09
NVIDIA GeForce RTX 5090 NVIDIA Workstation 32 GB GDDR7 1.79 TB/s 209.5 TF 575 W 2025-01
NVIDIA RTX PRO 6000 Blackwell NVIDIA Workstation 96 GB GDDR7 1.79 TB/s 600 W 2025-03
NVIDIA GeForce RTX 4090 NVIDIA Workstation 24 GB GDDR6X 1.01 TB/s 165.2 TF 450 W 2022-10
AMD Radeon AI PRO R9700 AMD Workstation 32 GB GDDR6 640 GB/s 191 TF 300 W 2025-07
Tenstorrent Blackhole p150a Tenstorrent Workstation 32 GB GDDR6 512 GB/s 300 W 2025-08
Apple Mac Studio (M3 Ultra) Apple Apple unified 512 GB Unified LPDDR5X 819 GB/s 270 W 2025-03
Apple MacBook Pro (M5 Max) Apple Apple unified 128 GB Unified LPDDR5X 614 GB/s 40 W 2026-03
Apple Mac Studio (M4 Max) Apple Apple unified 128 GB Unified LPDDR5X 546 GB/s 160 W 2025-03
Apple MacBook Pro (M4 Max) Apple Apple unified 128 GB Unified LPDDR5X 546 GB/s 40 W 2024-10
Apple Mac mini (M4 Pro) Apple Apple unified 64 GB Unified LPDDR5X 273 GB/s 65 W 2024-10
Apple MacBook Air (M5) Apple Apple unified 32 GB Unified LPDDR5X 153 GB/s 20 W 2026-03
NVIDIA DGX Spark (GB10) NVIDIA x86 unified 128 GB Unified LPDDR5X 273 GB/s 140 W 2025-10
AMD Ryzen AI Max+ 395 (Strix Halo) AMD x86 unified 128 GB Unified LPDDR5X 256 GB/s 120 W 2025-01
Qualcomm Snapdragon X2 Elite Qualcomm AI PC 48 GB LPDDR5X 228 GB/s 80 W 2025-09
Intel Core Ultra 200V (Lunar Lake) Intel AI PC 32 GB LPDDR5X 136 GB/s 37 W 2024-09
Qualcomm Snapdragon X Elite Qualcomm AI PC 64 GB LPDDR5X 135 GB/s 80 W 2024-06

Cross-links: /stack/silicon · the self-host course modules on memory math, quantization, and inference engines.