Common Pile v0.1

Common Pile v0.1 is an 8TB pretraining corpus assembled by EleutherAI with around a dozen partner institutions across 30 source datasets, released June 2025. The distinguishing feature is license-cleanness: every text source in the corpus is either explicitly under a permissive license (Creative Commons, public domain) or has documented permission for redistribution and retraining. EleutherAI also trained two reference models from it (Comma 7B variants) to demonstrate the corpus is usable. Common Pile matters because the open-source AI conversation hinges on whether you can train a competitive model without relying on web-scraped content of ambiguous license status. Compared to FineWeb (15T tokens, but not license-clean) and The Pile (the original 800GB EleutherAI corpus, 2021), Common Pile is smaller but comes with a credible license claim. This directly addresses the OSAID v1.0 contested clause: under the strictest reading of "open source AI," only models trained on license-clean data can qualify. Production-readiness: shipping and used by the reference Comma 7B models, but no frontier-class model has trained on Common Pile at scale yet (as of this entry's update). The honest ceiling: 8TB is a reasonable substrate for 7-30B-scale models, but pushing to 70B+ would require either a much larger v1.0 or mixing with non-license-clean sources. Common Pile is the existence proof; whether it scales to frontier remains the open question.

Other projects at the Data layer

6 siblings · ordered open first

Common Crawl Open source

Non-profit petabyte-scale monthly web crawl since 2008; the substrate underneath nearly every open AI corpus.

FineWeb / FineWeb-Edu Open source

HuggingFace 15T-token filtered Common Crawl; FineWeb-Edu is the higher-quality educational subset.

Dolma Open source

AI2's 3T-token training corpus; the substrate for OLMo; redistributable under ODC-BY license (updated from ImpACT in April 2024).

DataComp-LM (DCLM) Open source

Both a benchmark and a corpus; the 'ship the filter, not just the bytes' approach to open data.

RedPajama v1 / v2 Open source

Together AI's open replication of the Llama 1 training mix; v2 reaches 30T tokens.

The Pile Open source

EleutherAI's 2020 825GB open corpus; influential predecessor to all current open-corpus work.

Sources

Other projects at the Data layer

Grants attributed