Glossary

The Pile

An 825 GB diverse-source pretraining dataset assembled by EleutherAI in 2020, the open-corpus precedent that the later RedPajama and FineWeb projects expanded on.

Data also: Training aka Pile

EleutherAI’s 2020 open pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry corpus, mixing 22 sources including Common CrawldataA nonprofit-run repeated crawl of the public web maintained since 2007, the upstream raw source for nearly every open web-scale pretraining corpus. Open full entry , PubMed Central, ArXiv, GitHub, StackExchange, Wikipedia, books, and email archives. The Pile was the first widely-used open pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry corpus for language models and the substrate for the GPT-Neo and GPT-J releases.

Some constituent sub-datasets in The Pile have since been removed for copyright reasons (Books3 in particular). The Pile-CC and the cleaner subsets remain in use. Newer corpora (RedPajamadataAn early open reproduction of the Llama 1 pretraining corpus from Together AI (2023), now superseded by FineWeb and Dolma but historically important as the first open frontier-scale dataset. Open full entry , FineWebdataAn open large-scale web text dataset from Hugging Face, the highest-quality permissively-licensed pretraining corpus by 2024 to 2026 with ~15 trillion tokens after deduplication and filtering. Open full entry , Dolma) have displaced The Pile for new pretraining runs, but the curation choices and the citation of source-mixture proportions in the original paper set the template.

Sources

The Pile: An 825GB Dataset of Diverse Text for Language Modeling (Gao et al., 2020)

Mentioned in

Back to glossary