The Open-Source AI Stack
RSS

Glossary

Common Crawl

A nonprofit-run repeated crawl of the public web maintained since 2007, the upstream raw source for nearly every open web-scale pretraining corpus.

A continuously-maintained crawl of the public web, packaged as WARC files in petabyte-scale monthly dumps. Common Crawl is the upstream raw source for FineWebdataAn open large-scale web text dataset from Hugging Face, the highest-quality permissively-licensed pretraining corpus by 2024 to 2026 with ~15 trillion tokens after deduplication and filtering. Open full entry , RedPajamadataAn early open reproduction of the Llama 1 pretraining corpus from Together AI (2023), now superseded by FineWeb and Dolma but historically important as the first open frontier-scale dataset. Open full entry , C4, The PiledataAn 825 GB diverse-source pretraining dataset assembled by EleutherAI in 2020, the open-corpus precedent that the later RedPajama and FineWeb projects expanded on. Open full entry (partly), Dolma, and nearly every open large-scale web corpus.

The Foundation is a US nonprofit; the data is offered under permissive terms for research and commercial use; the AWS Open Data program hosts the public bucket. Crawl-quality filtering, language identification, deduplication, and curriculum design are the work each downstream dataset adds on top.

The dataset is contested. Copyright lawsuits against AI labs increasingly invoke Common Crawl content as evidence; the Foundation’s response has been to clarify that hosting public web content is not training authorization. The legal status of pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry on Common Crawl remains unsettled through 2026.

Sources

Mentioned in

Back to glossary