Common Pile v0.1 is an 8TB pretraining corpus assembled by EleutherAI with around a dozen partner institutions across 30 source datasets, released June 2025. The distinguishing feature is license-cleanness: every text source in the corpus is either explicitly under a permissive license (Creative Commons, public domain) or has documented permission for redistribution and retraining. EleutherAI also trained two reference models from it (Comma 7B variants) to demonstrate the corpus is usable. Common Pile matters because the open-source AI conversation hinges on whether you can train a competitive model without relying on web-scraped content of ambiguous license status. Compared to FineWeb (15T tokens, but not license-clean) and The Pile (the original 800GB EleutherAI corpus, 2021), Common Pile is smaller but comes with a credible license claim. This directly addresses the OSAID v1.0 contested clause: under the strictest reading of "open source AI," only models trained on license-clean data can qualify. Production-readiness: shipping and used by the reference Comma 7B models, but no frontier-class model has trained on Common Pile at scale yet (as of this entry's update). The honest ceiling: 8TB is a reasonable substrate for 7-30B-scale models, but pushing to 70B+ would require either a much larger v1.0 or mixing with non-license-clean sources. Common Pile is the existence proof; whether it scales to frontier remains the open question.
The Stack · Data · Open source
Common Pile v0.1
EleutherAI's June 2025 license-clean pretraining corpus; one of few entries doing this work at scale.
Sources
- Common Pile v0.1 Announcement (EleutherAI) https://blog.eleuther.ai/common-pile/
- Common Pile on HuggingFace https://huggingface.co/blog/stellaathena/common-pile
- OSAID v1.0 (data clause) https://opensource.org/ai/open-source-ai-definition
- blog.eleuther.ai (audit-verified) https://blog.eleuther.ai/common-pile/
Want a follow-up? Ask the chat about Common Pile v0.1 in context. It will compare to siblings at the same layer and ground every claim in the wiki.
Other projects at the Data layer
6 siblings · ordered open first
- Common Crawl Open source
Non-profit petabyte-scale monthly web crawl since 2008; the substrate underneath nearly every open AI corpus.
- FineWeb / FineWeb-Edu Open source
HuggingFace 15T-token filtered Common Crawl; FineWeb-Edu is the higher-quality educational subset.
- Dolma Open source
AI2's 3T-token training corpus; the substrate for OLMo; redistributable under ODC-BY license (updated from ImpACT in April 2024).
- DataComp-LM (DCLM) Open source
Both a benchmark and a corpus; the 'ship the filter, not just the bytes' approach to open data.
- RedPajama v1 / v2 Open source
Together AI's open replication of the Llama 1 training mix; v2 reaches 30T tokens.
- The Pile Open source
EleutherAI's 2020 825GB open corpus; influential predecessor to all current open-corpus work.
Grants attributed
1 match from /grants
- Common Pile v0.1 2025-06 · Compute + collaboration
EleutherAI · funded by ai2
License-clean training dataset; trained two reference models from it. The cleanest open-data exemplar besides Dolma.