The Open-Source AI Stack
RSS
← All modules

02 Silicon

core

Chips and ISAs that execute the math.

Overview

The substrate. Instruction set architectures (ISAs) and the physical accelerators that execute matrix math. Silicon is the slowest-moving layer in the stack and the one everything above inherits constraints from: a chip that tapes out today ships ~18 months later and stays in service for a decade. Decisions made here propagate up through compute, runtime, weights, and agents for years.

Five things to keep in mind as you read:

  • Two independent openness questions live here. Is the instruction set open? Is the physical accelerator open? They don’t have the same answers.
  • RISC-VsiliconAn open instruction set architecture, royalty-free and modular, increasingly used in AI accelerator cores (Tenstorrent, SiFive Intelligence) as the open alternative to ARM and x86. Open full entry is the open ISA. ARM and x86 are not. ARM ships under a paid license; x86 is jointly held by Intel and AMD.
  • Almost no production accelerator is open. TenstorrentsiliconAn AI accelerator startup designing RISC-V-based chips (Wormhole, Blackhole, Grendel) with an open software stack, positioned as the leading open alternative to NVIDIA at the silicon layer. Open full entry is the noteworthy exception; everyone else (NVIDIA, AMD, Cerebras, GroqsiliconAn AI inference company with custom deterministic LPU chips and a hosted inference service that achieves extremely low time-per-token (1000+ tokens/sec on 70B models). Open full entry , Apple) ships closed silicon.
  • The real lock-in is software, not silicon. CUDAsiliconNVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s. Open full entry plus its library stack (cuDNN, cuBLAS, NCCL, TensorRT-LLM) is what makes NVIDIA hard to leave even when alternative chips exist.
  • This layer moves on a different clock than the rest of the stack. A new runtime ships in weeks; a new ISA takes a decade.

The rest of this page works through each question in turn.

The ISA question

An ISA is the contract between software and hardware. It defines the instructions a chip exposes and the registers and memory model software can rely on. ISAs are intellectual property in their own right, separable from the chips that implement them.

RISC-V is the open ISA. Originally a 2010 Berkeley research project, the base specification is BSD-licensed and ratified by RISC-V International (RISC-V specifications). Anyone can design and ship a RISC-V chip without paying a licensing fee. The catch for AI workloads is that RISC-V’s vector and matrix extensions (the parts that matter for ML) are still maturing, and the production AI silicon shipping today on RISC-V is small relative to the NVIDIA volume.

ARM is the most-licensed proprietary ISA. ARM ships under a paid architecture license; Apple’s M-series, NVIDIA’s Grace, and AWS Graviton all hold ARM architecture licenses. The implementations are proprietary; the ISA itself is closed but widely-licensed.

x86 is jointly held by Intel and AMD via the 1995 cross-license agreement that ended their litigation. No third party can implement x86 legally. For data-center AI inference, x86 still runs the CPU side of most clusters but the accelerator math happens on something else.

The accelerator question

A separate question. Even with an open ISA, the physical chip design can be open or closed; the two are independent.

The dominant production accelerator remains NVIDIA’s H100 (80GB HBM3) and H200 (141GB HBM3e) data-center GPUs (NVIDIA H100 product page), plus the Blackwell B100/B200/GB200 family shipping into 2025-2026. Closed silicon, closed schematics, closed firmware.

The credible alternatives, each closed but each playing a different game:

  • AMD MI300X (192GB HBM3, 5.3 TB/s) and MI325X (256GB HBM3e, 6 TB/s): more memory per accelerator than the NVIDIA contemporaries, software stack via ROCm and HIP, used by Microsoft and Meta with growing OpenAI inference adoption (AMD Instinct MI300X)
  • Cerebras WSE-3: wafer-scale chip, 900,000 cores and 4 trillion transistors on a single 5nm die, 44 GB on-chip SRAM, optimized for low-latency batch-1 inference (Cerebras WSE-3 press release)
  • Groq LPU: deterministic compiler-scheduled silicon, 230 MB on-chip SRAM (no external memory), very fast token throughput on dense transformer inference (Groq LPU architecture)
  • Apple Silicon (M-series with Neural Engine + GPU + unified memory; M4 Max ships up to 128 GB of unified memory): the open-weights ecosystem’s preferred local-inference platform, via MLXruntimeApple's open-source ML framework designed for Apple Silicon's unified memory architecture, the local-first inference engine for Mac and increasingly iPad and iPhone. Open full entry and Metal (Apple M4 Pro / M4 Max announcement)
  • Tenstorrent Wormhole and Blackhole: the noteworthy open-silicon counter-bet, RISC-V-based (Wormhole uses Tensix cores with RISC-V control processors; Blackhole adds 16 “big” RISC-V CPU cores and up to 32GB GDDR6), with open software stack (Tenstorrent Wormhole product page)

The competitive question is not whether these chips exist (they do, and they ship at scale) but whether their software stacks let real workloads move off CUDA. So far the answer is “for defined inference workloads, sometimes; for training, almost never.”

The CUDA moat

The lock-in at this layer is software. NVIDIA’s CUDA platform shipped in 2007. The accumulated kernels, libraries, and tuning work since then (cuDNN for deep learning primitives, cuBLAS for linear algebra, NCCL for multi-GPU communication, TensorRT-LLM for inference) are what make NVIDIA hard to leave even when the underlying chips have credible competitors.

AMD’s response is ROCm (the rough CUDA equivalent) and HIP (a source-to-source translation layer that lets CUDA code run on ROCm). HIP works, but the published benchmark gap and the operator-coverage gap between CUDA and ROCm on real workloads remains a few-x and a long tail respectively. As of 2026, AMD is closing it; closed isn’t the same as caught up.

The underlying point is that “the chip” and “the productive software stack for that chip” are different things to be open or closed about. A learner who only asks “is the silicon open?” misses the question that actually decides which workloads can move.

What’s open and what isn’t

ISA-open and accelerator-open are independent.

  • Open ISA, closed accelerator: most modern ARM chips (licensed ARM ISA, proprietary implementations). Doesn’t apply cleanly to AI accelerators today.
  • Open ISA, open accelerator: Tenstorrent. The only entry with both at production scale.
  • Closed ISA, closed accelerator: NVIDIA H100/H200, Blackwell, AMD MI300X, Cerebras, Groq, Apple Silicon. Almost everything else.
  • Open ISA, open accelerator, open software stack: nothing yet at production scale. The asymptote toward which Tenstorrent plus an open compiler stack would point.

The OSI doesn’t define open hardware; the Open Source Hardware Association (OSHWA) does (OSHWA definition), and the RISC-V International specs follow it. Open hardware definitions are still less binding than open software definitions in practice because almost no AI silicon meets them yet.

The editorial tension

Open silicon as a sovereignty bet has the longest payoff horizon of any layer on this site. Even if Tenstorrent ships well and the open ROCm story matures, the production AI capacity won’t shift on the timescale of policy debates. The realistic five-year story is that CUDA stays dominant, AMD captures some share, and open-ISA silicon survives in research and specialty-inference niches rather than dethroning anyone.

The argument against shrugging anyway is that silicon decisions made today set the constraint surface for the next decade. The 2026 capex going into hyperscaler GPUs is the 2036 installed base. If sovereign-state programs (UAE, Saudi, India) all default to NVIDIA, they’ve quietly chosen who can audit and who can withhold ten years out. The open-silicon bet is slow because silicon is slow, not because it doesn’t matter.

Key terms for this layer

  • A 16-bit floating-point format with FP32's exponent range and only 7 mantissa bits. Designed for neural-network training; standard across 2026 accelerators alongside FP16.

  • An AI compute company built around wafer-scale chips (the WSE-3 is a single die covering most of a 300mm wafer), offering some of the lowest inference latency on the market.

  • NVIDIA's parallel-computing platform and proprietary toolchain, the de facto programming model for GPU-accelerated machine learning since the late 2000s.

  • A 16-bit floating-point format used as the default precision for deep learning training and inference, halving memory versus FP32 with small quality cost on most workloads.

  • A 4-bit floating-point format with hardware-native multiplication on Blackwell-generation accelerators. NVFP4 and MXFP4 variants target large-model inference and post-training quantization.