FineWeb is HuggingFace's open pretraining dataset. ~15 trillion tokens after filtering Common Crawl with a series of quality classifiers and dedup steps. Released under ODC-By 1.0 (Open Data Commons attribution license). FineWeb-Edu is a higher- quality educational subset filtered with an LLM classifier; smaller (~1.3T tokens) but reliably stronger for training small- to-mid models. FineWeb 2 (late 2024) extended the work to multilingual coverage across ~1000 languages. FineWeb matters because data-filtering is where the open-data ecosystem has caught up most visibly to closed-data labs. The released ablations show that careful filtering of Common Crawl can produce a corpus that trains models competitive with what the closed labs publish, at sizes where the comparison is possible. Compared to siblings: The Pile (foundational, smaller), Dolma (AI2, OLMo's training corpus), DCLM (open competitive benchmark for filtering methodology), Common Pile (license-clean angle), RedPajama (open Llama 1 replication corpus). FineWeb is the largest of the well-documented open corpora and the one modern open-weights training runs most often start from. Production-ready and used by many open-weights training projects in 2024-2026. The catch: it is not license-clean (ODC-By does not solve the underlying copyright question for web-scraped content), so it does not satisfy the strictest reading of open-source AI. For that, see Common Pile.
The Stack · Data · Open source
FineWeb / FineWeb-Edu
HuggingFace 15T-token filtered Common Crawl; FineWeb-Edu is the higher-quality educational subset.
Sources
- FineWeb on HuggingFace https://huggingface.co/datasets/HuggingFaceFW/fineweb
- FineWeb Blog Post (HuggingFace) https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
- FineWeb-Edu Dataset https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
- FineWeb 2 Multilingual https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
Want a follow-up? Ask the chat about FineWeb / FineWeb-Edu in context. It will compare to siblings at the same layer and ground every claim in the wiki.
Other projects at the Data layer
6 siblings · ordered open first
- Common Crawl Open source
Non-profit petabyte-scale monthly web crawl since 2008; the substrate underneath nearly every open AI corpus.
- Dolma Open source
AI2's 3T-token training corpus; the substrate for OLMo; redistributable under ODC-BY license (updated from ImpACT in April 2024).
- DataComp-LM (DCLM) Open source
Both a benchmark and a corpus; the 'ship the filter, not just the bytes' approach to open data.
- RedPajama v1 / v2 Open source
Together AI's open replication of the Llama 1 training mix; v2 reaches 30T tokens.
- The Pile Open source
EleutherAI's 2020 825GB open corpus; influential predecessor to all current open-corpus work.
- Common Pile v0.1 Open source
EleutherAI's June 2025 license-clean pretraining corpus; one of few entries doing this work at scale.