The Open-Source AI Stack
RSS
← All modules

01 Infrastructure

core

Data centers, power, cooling, and the grid that runs the rest of the stack.

Overview

The infrastructure layer is where the math has to physically happen: the buildings, the electricity, the cooling, and the grid interconnect that determine whether a multi-gigawatt training run can even start. It sits below silicon (the chips themselves) and below compute (the control-plane scheduling that runs on those chips). The question this layer answers is whether the math gets done at all, and on whose land, under whose regulator, drawing from whose grid.

Five things to keep in mind as you read:

  • The binding constraint moved. Through 2024 it was GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry allocation. Through 2025 and 2026 it became power.
  • A new category of operator showed up. Sovereign-state programs (G42, Humain, IndiaAI) joined hyperscalers and AI-native lessors (CoreWeave, Lambda, Crusoe, Nebius) in the capex race.
  • There are two distinct shapes of “decentralized”. Marketplaces of independent operators (Akash, io.net, Bittensor) versus one company siting many small facilities (Crusoe’s original model). Don’t conflate them.
  • Almost nothing here is OSI-open. “Openness” at this layer is jurisdictional and organizational, not licensing.
  • The editorial tension is concentration versus distribution. Whether 5 GW lives in 4 hyperscaler regions or 40 operators across 20 jurisdictions decides what’s technically possible.

The rest of this page works through each of these in turn.

The constraint moved from chips to electricity

Three events tell the story.

In September 2024, Constellation Energy announced a 20-year power-purchase agreement with Microsoft to restart Three Mile Island Unit 1 (rebranded the Crane Clean Energy Center, 835 MW, target restart 2028) to feed Azure AI (Constellation announcement, Sept 20 2024). The first US nuclear restart driven by AI demand. In January 2025, OpenAI, SoftBank, Oracle, and MGX launched the $500B Stargate Project, with $100B of immediate deployment (OpenAI Stargate announcement, Jan 21 2025). In March 2024, Amazon Web Services bought the Cumulus Data campus in Berwick, Pennsylvania for $650M, co-located with Talen Energy’s Susquehanna nuclear plant and with a power-purchase agreement attached (Talen press release, March 4 2024).

The pattern is that the capital is no longer chasing GPUs. It’s chasing electrons.

The lead-time asymmetry is the reason. A chip order with NVIDIA takes months. Building gigawatt-scale generation takes years: nuclear restarts are the fastest path and still measure in multi-year increments; greenfield reactors are five to ten years from interconnect filing to first power; even gas turbines bump into queue constraints in the major markets. Chip procurement is a commercial problem with a clear counterparty. Power procurement is a multi-year political and regulatory problem involving utilities, grid operators, environmental review, and in many cases state legislatures.

The capex race and the new entrants

The hyperscalers (Microsoft Azure, Google, AWS, Meta, Oracle) still own most of the frontierweightsThe current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold. Open full entry -scale capacity. Two new categories joined them.

Sovereign-state programs put national budgets into AI infrastructure as a strategic asset. The UAE’s G42 took a $1.5B strategic investment from Microsoft in April 2024 and continues to expand under MGX (Microsoft G42 announcement, April 16 2024). Saudi Arabia’s sovereign computeinfrastructureAI compute capacity owned, operated, or contractually controlled by a nation-state for the use of its own institutions and citizens, distinct from rented capacity on US-hyperscaler clouds. Open full entry , launched May 13 2025 under PIF, is chaired by HRH Crown Prince Mohammed bin Salman and is building 1.9 GW of AI-focused data-center capacity by 2030, with a roadmap to 6.6 GW over the following four years, alongside a separate $10B venture-capital fund (Humain launch, PIF press release, May 13 2025). The IndiaAI Mission committed ₹10,300 crore (~$1.14B) in March 2024, with explicit subsidized-GPU language for Indian startups and academia (Cabinet approval, PIB India, March 7 2024).

AI-native infrastructure operators sit between the hyperscalers and the long tail: CoreWeave (IPO’d March 2025), Lambda, Crusoe, and Nebius. They run AI-purpose-built campuses but lease capacity wholesale rather than serving end users. They moved on power-and-siting before the hyperscalers fully woke up to it, which is now translating into bargaining leverage.

Counterweights, categorized properly

The decentralized response to capex concentration comes in two organizationally distinct shapes. They’re often conflated. They shouldn’t be.

Organizational decentralization

A market design where many independent operators contribute capacity and a protocol matches supply to demand. Akash Network bids GPU jobs against node operators with a token-settled rail. io.net aggregates idle GPUs from independent providers behind a single API. Bittensor’s compute subnets (Subnet 27 for the original compute mining, Subnet 51 for the newer P2P GPU rental marketplace) reward operators in TAO for serving workloads. The operator set is permissionless, no single entity owns the capacity, and the worst-case failure is degraded service rather than a single-vendor outage. Bittensor’s decentralized-training work lives on Subnet 3 (Templar) and is covered at the training layer.

Physical decentralization

One company sites many small facilities instead of one giant campus. Crusoe’s original play was modular data centers co-located with flare gas stacks in North Dakota and Texas oil fields, capturing energy that was otherwise being burned. That’s physically distributed but organizationally centralized: Crusoe is a single proprietary company that owns and operates every site.

The two models share a “not the hyperscaler default” vibe but solve different problems. Marketplace decentralization addresses operator-set diversity. Physical decentralization addresses siting and energy economics.

Companies move on this spectrum

Crusoe itself illustrates the point. Crusoe is the lead developer of Stargate’s flagship campus in Abilene, Texas (planned 1.2 GW at the Lancium Clean Campus, expanded May 2025 from two buildings to eight with $11.6B in debt and equity, first phase live on Oracle Cloud Infrastructure as of Sept 30 2025) (Crusoe Abilene live announcement, Sept 30 2025), a concentrated multi-gigawatt campus. The company that started as the poster child for distributed flare-gas siting is now the contractor for the hyperscaler-default play. The category label doesn’t stick to a company across its history; it sticks to the specific business model.

A third axis cross-cuts both. behind-the-meterinfrastructureA power arrangement where generation sits on the same side of the utility meter as the load, letting a data center draw directly from the plant and bypass the grid. Open full entry siting (where the data center sits on the same side of the utility meter as the generation, bypassing the grid interconnect queueinfrastructureThe regulatory queue a new generation or load project must traverse to connect to the grid; currently the binding constraint on how fast gigawatt-class AI sites come online. Open full entry ) is geographically agnostic and can serve either model.

What’s open and what isn’t

Almost nothing at this layer is open-source in the OSIgovernanceThe nonprofit that maintains the canonical Open Source Definition for software since 1998, and the OSAID definition for AI as of 2024. Open full entry sense. Data centers are buildings. Transformers and switchgear are commercial products. Power purchase agreements are private contracts. There is no equivalent here to vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry or DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry -V3 that you can fork and run.

The relevant openness question shifts. Not “is the artifact open-source?” but “whose state has signing authority over the gigawatts, and how many independent operators are in the set?”. That’s why infrastructure maps tightly to the sovereignty and decentralization primitives layer rather than to governance, which is about licensing further up the stack.

The editorial tension

Concentration versus distribution of gigawatts. That’s the debate.

Whether 5 GW of frontier compute lives in 4 hyperscaler regions or 40 operators across 20 jurisdictions decides whether sovereign-state or individual-scale open-source AI is technically possible at frontier scale at all.

The concentration side argues that frontier training requires tightly-coupled, high-bandwidth gigawatt-scale campuses that only hyperscalers and well-financed states can build. The distribution side argues that inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry , fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry , and even some training workloads can be served from heterogeneous physically-distributed compute, and that the sovereignty cost of running an open ecosystem on closed infrastructure eventually dominates the efficiency cost of running it on slower distributed alternatives.

Both sides agree the trajectory is uncertain. The disagreement is which side has the lower long-run cost.

Key terms for this layer

  • AI factory full entry →

    A purpose-built data center optimized for AI training rather than general cloud workloads, characterized by liquid-cooled high-density GPU racks, gigawatt-scale single-tenant power, and tightly-coupled networking.

  • behind-the-meter full entry →

    A power arrangement where generation sits on the same side of the utility meter as the load, letting a data center draw directly from the plant and bypass the grid.

  • decentralized GPU marketplace full entry →

    A protocol market matching GPU supply from many independent providers to AI demand, settled on a token rail; Akash, io.net, Bittensor compute, and Hyperbolic are canonical.

  • direct-to-chip cooling full entry →

    A cooling architecture that pipes liquid coolant directly to a cold plate on each processor, evacuating the 700+ watts per GPU that air cooling cannot handle.

  • gigawatt-class cluster full entry →

    An AI training facility whose power draw is measured in gigawatts rather than megawatts, the scale at which siting decisions become grid-and-permitting problems rather than real-estate ones.