05 Training

core

Tools to pretrain and fine-tune.

Overview

The tools that take open data plus an architecture and produce weights. This is a tool layer, not a model layer; the weights it produces live one layer up at weights.

Five things to keep in mind as you read:

Two distinct ecosystems share the layer. Pretraining infrastructure (frontier labs, thousands of GPUs) and fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry infrastructure (individuals, one to eight GPUs).
pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry stacks: Megatron-LM, DeepSpeed, FSDP, JAX on TPU. Used to train base models from scratch.
Fine-tuning stacks: Unsloth, Axolotl, LLaMA-Factory, HuggingFace TRL. Used to adapt existing open-weights models.
Decentralized training is a third cluster. Prime Intellect, Nous DisTrO, Templar, Pluralis. Smaller and slower than centralized frontier training, but shipping.
The 2026 sovereignty story lives in fine-tuning. Llama-class models went from “needs an A100” to “runs on a Mac plus a consumer GPU” because of the open fine-tuning stack.

The rest of this page works through each ecosystem and then the research direction that decides where this layer goes.

Pretraining infrastructure

What frontier labs use to train a base model from scratch on a cluster of thousands of GPUs over weeks or months.

The dominant stacks:

Megatron-LM (NVIDIA, open source) — the reference implementation for tensor parallelism, pipeline parallelism, and sequence parallelism on NVIDIA hardware. Most non-Google frontier labs build on Megatron or a descendant (Megatron-LM repository).
DeepSpeed (Microsoft, open source) — ZeRO-style memory partitioning for very large models on relatively modest GPU counts. Pairs with Megatron in the “Megatron-DeepSpeed” combination most labs settle on (DeepSpeed announcement).
FSDP (PyTorch native, Meta) — Fully Sharded Data Parallel, the PyTorch-native answer to ZeRO. Easier integration into PyTorch projects than Megatron-DeepSpeed; less mature for the very largest scales (FSDP documentation).
JAX on TPU (Google) — what Google uses internally. Pallas for custom kernels, Flax for model code. Open source but tightly coupled to TPU, which is closed silicon you can’t buy (JAX repository).

These stacks are open in the sense that anyone can clone the repos. They are not open in the sense that anyone can usefully run them: a frontier pretraining run needs a thousand-plus GPU cluster on a tight InfiniBand fabric, which exists at fewer than fifty operators worldwide. The tools are democratized; the hardware to run them at scale is not.

Fine-tuning infrastructure

The other half of the layer. Takes an existing open-weights base model and adapts it to a domain, task, or style, usually with a much smaller dataset and far less compute.

The dominant stacks:

HuggingFace TRL (Transformer Reinforcement Learning) — the reference fine-tuning library, ships with SFT, DPO, GRPO, and PPO trainers (TRL repository).
Unsloth — kernel-level optimizations for fine-tuning, ~2-5x faster than baseline HuggingFace for many setups, fits larger models on a single GPU via aggressive memory optimization (Unsloth repository).
Axolotl — a YAML-config wrapper over the lower-level libraries, the most-used fine-tuning recipe layer for serious practitioners (Axolotl repository).
LLaMA-Factory — Chinese-community-favorite alternative to Axolotl, broader model and method coverage including the Chinese open-weights families (LLaMA-Factory repository).

The technical pivot that made this category accessible was parameter-efficient fine-tuning (PEFT), specifically LoRAtrainingA parameter-efficient fine-tuning method that injects small low-rank adapter matrices into a frozen base model, training a tiny fraction of weights instead of the full model. Open full entry and QLoRA. LoRA trains a small set of low-rank “adapter” matrices instead of the full model weights, cutting memory cost by an order of magnitude. QLoRA combines that with 4-bit quantization of the frozen base model, which is what made the “fine-tune a 70B model on a single 24GB GPU” stories possible from late 2023 forward.

Decentralized training

The third cluster. Smaller and slower than centralized frontier training, but shipping working models that prove the basic viability.

Prime Intellect ran INTELLECT-1 (10B, November 2024) and INTELLECT-2 (32B, May 11 2025) as fully-distributed training runs across volunteer compute, using the OpenDiLoCo algorithm to cut cross-node bandwidth requirements (INTELLECT-2 release). INTELLECT-3 (106B MoE with 12B activated, 131K context) shipped May 28 2025 (INTELLECT-3 page).
Nous DisTrO is a training framework that compresses the cross-node gradient updates aggressively (orders of magnitude), letting a training run survive on commodity internet bandwidth instead of InfiniBand (DisTrO project).
Templar (Bittensor Subnet 3) is the live permissionless-and-incentivized decentralized-training network. In March 2026 it completed Covenant-72B (72B parameters, ~1.1T tokens, MMLU 67.1), the largest decentralized-LLM pretraining run to that point (Templar / Subnet 3).
Pluralis is the related research-stage project on the bandwidth-vs-accuracy axis.

The technical bet is that the bandwidth-efficient communication algorithms (DiLoCo / OpenDiLoCo / DisTrO) compress the communication cost enough that geographically-distributed training matches centralized training for the same compute budget. As of 2026, the gap has closed enough that 30B-100B scale training is viable on commodity internet links (INTELLECT-3 at 106B, Templar’s Covenant at 72B). The open question is whether the curve continues to frontier scale.

What’s open and what isn’t

The training stacks are mostly open source software. What is not open is the hardware to run pretraining at scale (covered in compute and infrastructure) and the training data (covered in data).

The asymmetry: anyone can clone Megatron-DeepSpeed and read the parallelism implementations. Almost nobody can rent the cluster to run them at frontier scale. The “open training stack” matters most for fine-tuning, where the hardware is accessible, and least for pretraining, where the hardware is the binding constraint.

The reverse-lock-in risk at this layer is when the training stack itself becomes proprietary. NVIDIA’s NeMo framework is the example: it’s an integrated training-and-deployment stack that’s easier to use than raw Megatron but binds the training pipeline to NVIDIA’s ecosystem more tightly than the lower-level libraries do.

The editorial tension

The democratization story at this layer is real and recent. Fine-tuning a Llama-class model went from “needs a $30k A100 cluster” in early 2023 to “runs on a $1500 consumer GPU plus a Mac” by late 2025. That happened because of LoRA / QLoRA, the open fine-tuning libraries, and the open-weights model releases that gave practitioners something to fine-tune.

The frontier-pretraining story is the opposite. The compute needed to pretrain a competitive frontier model went up by an order of magnitude over the same period, and the lab count that can run a frontier pretraining run has stayed roughly constant (or shrunk). The training stacks are open; the access to training is concentrating.

The decentralized training research is the wildcard. If it keeps closing the bandwidth-efficiency gap, the hyperscaler-only pretraining era ends and the layer becomes genuinely sovereign-accessible. If it stalls in the 30-100B range, the open fine-tuning ecosystem stays vibrant but the frontier remains a hyperscaler problem indefinitely.

Key terms for this layer

alignment full entry →

The training-and-evaluation work of shaping a model's behavior to match human intent, refuse harmful requests, and answer honestly, distinct from raw capability training.
Axolotl full entry →

An open YAML-driven fine-tuning framework that orchestrates Hugging Face Transformers, PEFT, TRL, and DeepSpeed for one-shot LoRA, QLoRA, and full fine-tuning workflows.
DeepSpeed full entry →

Microsoft's open-source training optimization library, originator of the ZeRO sharding technique and a peer to Megatron for distributed transformer training at scale.
DPO full entry →

A preference-tuning method that optimizes a model on pairwise human rankings directly, bypassing the reward-model and reinforcement-learning steps of RLHF.
fine-tuning full entry →

Continued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch.

Course agent

The course agent needs your API key to drive the dialogue.

Open Settings and paste an OpenRouter key. It stays in your browser; the server never sees it.

Open Settings →

From the rest of the stack

Funders (4) all →

Astera Institute · US
$125K-$250K salary plus up to $1.5M project budget
a16z Open Source AI Grants · US
Undisclosed; typical individual maintainer-scale grants
AI Grant (Friedman / Gross) · Global
$5K-$50K (open source) or $250K SAFE (accelerator)
Stanford HAI Hoffman-Yee Grants · US
$500K Year 1, up to $2M two-year extension

Reading list (13) all →

Decentralized Training is Hard
Post · Prime Intellect (research blog) · 2025
The Llama 3 Herd of Models
Paper · Meta AI · 2024
The DeepSeek-V3 Technical Report
Paper · DeepSeek · 2024
DisTrO: Distributed Training Over-the-Internet
Paper · Nous Research · 2024
Direct Preference Optimization
Paper · Rafailov et al. · 2023