04 Data
coreTraining corpora, open and closed.
Overview
The training corpora. The text, code, images, audio, and synthetic data that go into pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry and fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry . This is the layer where the word “open” carries the most weight, because releasing weights without releasing data lets a lab claim openness while still owning the part competitors couldn’t reproduce.
Five things to keep in mind as you read:
- Open weights and open data are different. Most “open-weights” releases keep training data closed.
- A handful of fully-open datasets exist. OLMo’s Dolma, RedPajamadataAn early open reproduction of the Llama 1 pretraining corpus from Together AI (2023), now superseded by FineWeb and Dolma but historically important as the first open frontier-scale dataset. Open full entry , FineWebdataAn open large-scale web text dataset from Hugging Face, the highest-quality permissively-licensed pretraining corpus by 2024 to 2026 with ~15 trillion tokens after deduplication and filtering. Open full entry , the Common Pile. Almost all derive from Common CrawldataA nonprofit-run repeated crawl of the public web maintained since 2007, the upstream raw source for nearly every open web-scale pretraining corpus. Open full entry plus filtering.
- The legal posture is unsettled. NYT v. OpenAI is unresolved; the answer changes what training data labs can use.
- Licensed-data deals are non-transferable. When a lab pays a publisher for training data, that license doesn’t pass to downstream users of the resulting model.
- OSAIDgovernanceThe OSI's October 2024 definition of "open source AI," requiring not just weights but enough information about data, code, and architecture for third parties to reproduce the system. Open full entry v1.0 made this the central definitional fight. Whether full data release is required, or whether “enough detail to retrain” suffices, is the open argument in 2026.
The rest of this page works through each.
The fully-open datasets
A small canon. They share two properties: source documents are mostly Common Crawl, and the filtering / deduplication pipeline is published.
- Dolma (AI2, 2024) — 3 trillion tokens, used to pretrain the OLMo family. Among the cleanest and most-documented open pretraining corpora (Dolma paper, arXiv 2402.00159).
- RedPajama (Together AI, 2023; v2 in 2024) — open reproduction of LLaMA’s training data with a 30T-token v2 release (RedPajama-Data-v2 announcement).
- FineWeb / FineWeb-Edu (HuggingFace, 2024) — 15T tokens of filtered Common Crawl, with an “edu” variant focused on high-quality text (FineWeb release).
- Common Pile v0.1 (EleutherAI with Poolside, HuggingFace, the US Library of Congress, and ~14 academic partners, released June 2025) — 8 TB of public-domain and permissively-licensed text; the successor to The Pile, with cleaner provenance (Common Pile v0.1 announcement).
The realistic frontier training run uses tens of trillions of tokens. The fully-open corpora are now in that range, which means a lab that wants to do open training has the raw material; the binding constraint has moved to compute and pipeline engineering rather than data.
The legal landscape
Three live threads.
NYT v. OpenAI (S.D.N.Y. 1:23-cv-11195, filed December 2023) alleges that OpenAI’s training on Times articles constitutes copyright infringement, that ChatGPT can regurgitate Times content nearly verbatim, and that the fair-use defense doesn’t apply at the scale of training a commercial model (initial complaint, NYT). The case is unresolved through 2026; how it resolves changes what training data labs in US jurisdiction can legally use.
Licensed-data deals have become the dominant compliance story for the major labs. OpenAI signed publisher deals with Axel Springer (Dec 2023), the Financial Times (Apr 2024), News Corp (May 2024), and Vox Media (May 2024); Google has comparable arrangements via Reddit and others. These are typically non-transferable: the lab’s license to train on the corpus does not extend to downstream users of the resulting model.
Regional regimes diverge. The EU AI Act, in force from August 2024, requires training-data summaries for general-purpose AI models placed on the EU market (EU AI Act, Recital 107). The US has no federal training-data disclosure requirement; what exists is a patchwork of state-level rules and the active NYT litigation. China’s rules require training data to comply with the country’s content policies, but don’t require disclosure.
The OSAID definition fight
The Open Source AI Definition v1.0 from OSI, finalized October 2024, is the first formal attempt to say what counts as “open source AI” (OSAID 1.0 announcement). The text requires three things: the weights, the training code, and “sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system” — not the data itself.
That clause is the entire fight. The pro-OSAID-v1.0 reading is that requiring full data release would exclude almost every commercially-trained model (because of the publisher deals above) and so kill the definition’s adoption. The critics’ reading is that “describe how to build it” without shipping the actual training data lets labs claim openness while keeping the part competitors actually need.
The labs that ship full data alongside weights (AI2’s OLMo, the EleutherAI work on the Common Pile) make a clean case for the strict reading. The labs that ship weights only with detailed docs (Meta’s Llama, Mistral’s open models) make a clean case for the v1.0 reading. The 2026 OSAID revision cycle is the operational moment when this argument resolves one way or the other.
What’s open and what isn’t
The matrix:
- Open weights AND open data AND open training code: OLMo, Pythia (legacy), a handful of EleutherAI releases. The strict “open source AI” definition.
- Open weights, partial / no data: Llama, Mistral, Gemma, Qwen, DeepSeek, Kimi, GLM. The “open-weights” category, which is what most users mean by “open AI” in practice.
- Open data, no model release: the dataset projects (Dolma, RedPajama, FineWeb, Common Pile) before anyone has trained on them.
- Closed weights, closed data: GPT-5, Claude 4, Gemini 3. The frontier-lab default.
The asymmetry: openness on weights is easy to ship (one file upload); openness on data is hard (terabytes, license review, publisher relationships). Almost no lab makes the second trade unless their funding source explicitly requires it (AI2 from Microsoft / Paul Allen estate, EleutherAI from grant funding).
The editorial tension
The training-data question is where the rest of the openness debate ultimately resolves. A model whose weights are open but whose data is unknown can be run, fine-tuned, and inspected; it cannot be re-trained, audited for bias-by-omission, or defended against a “we trained on stolen IP” claim. The sovereignty argument for full data release is that without it, you depend on the lab’s word about what went in.
The argument against requiring full data release is economic: the major labs spent billions on the publisher deals that made their training corpora legally clean, and forcing disclosure of those corpora would invalidate the deals. A definition that nobody at frontier scale meets is a definition that doesn’t shape the market. OSAID v1.0 made the pragmatic call. Whether that call holds through the 2026 revision is what the next year decides.
Key terms for this layer
- BPE full entry →
A subword tokenization algorithm that iteratively merges the most-frequent byte pairs in a corpus, producing a vocabulary that balances common-word coverage with arbitrary-text fallback.
- Common Crawl full entry →
A nonprofit-run repeated crawl of the public web maintained since 2007, the upstream raw source for nearly every open web-scale pretraining corpus.
- FineWeb full entry →
An open large-scale web text dataset from Hugging Face, the highest-quality permissively-licensed pretraining corpus by 2024 to 2026 with ~15 trillion tokens after deduplication and filtering.
- RedPajama full entry →
An early open reproduction of the Llama 1 pretraining corpus from Together AI (2023), now superseded by FineWeb and Dolma but historically important as the first open frontier-scale dataset.
- The Pile full entry →
An 825 GB diverse-source pretraining dataset assembled by EleutherAI in 2020, the open-corpus precedent that the later RedPajama and FineWeb projects expanded on.