EleutherAI released Common Pile v0.1 in June 2025: an 8TB pretraining corpus assembled exclusively from works whose licenses permit redistribution, modification, and reuse for any purpose. The corpus draws from 30 sources, including roughly 300,000 public-domain books from the Library of Congress and the Internet Archive, the openly licensed subset of Stack v2 (code), arXiv papers, PubMed abstracts, and other curated openly licensed text. The acceptance criterion is the Open License Definition, which excludes both proprietary and ambiguously licensed content; works whose license status could not be confirmed were dropped.
The team trained two 7B reference models, Comma v0.1-1T and Comma v0.1-2T, for 1 trillion and 2 trillion tokens respectively. Comma reportedly outperforms models trained on KL3M, OLC, and Common Corpus, matches models trained on The Pile and OSCAR, and shows a gap relative to FineWeb-trained models. The point of the reference run is to test whether a license-clean corpus can produce models comparable to those trained on scraped or contested data; the result was that it can within the same training regime, with a measurable but not catastrophic gap to the best web-scraped corpora.
The project was a multi-institution collaboration. EleutherAI led the work in partnership with Poolside, Hugging Face, and the US Library of Congress, with Ai2 among the partnering institutions providing compute and collaboration. The arXiv paper (Kandpal et al., 2506.05209) documents the source-by-source license verification process, which is the part that is hard to replicate and the reason a license-clean corpus at this scale was not previously available.
Common Pile sits alongside Ai2's Dolma as one of the two largest license-clean pretraining corpora available to outside teams. For groups that need to defend the provenance of their training data, whether for EU AI Act compliance, US copyright litigation exposure, or downstream commercial licensing, these two corpora are the practical floor. The competing alternative is to train on scraped web data and hope discovery does not reach the training set, which is the bet most commercial labs are currently making.
Recipient
EleutherAI
Funder
Allen Institute for AI (Ai2) · foundation · US
Funder-and-builder. Young Investigator Program, AI2 Incubator. Builds OLMo / Tülu / Molmo (the only major fully-open model families).