FineWeb / FineWeb-Edu · The Open-Source AI Stack

FineWeb is HuggingFace's open pretraining dataset. ~15 trillion tokens after filtering Common Crawl with a series of quality classifiers and dedup steps. Released under ODC-By 1.0 (Open Data Commons attribution license). FineWeb-Edu is a higher- quality educational subset filtered with an LLM classifier; smaller (~1.3T tokens) but reliably stronger for training small- to-mid models. FineWeb 2 (late 2024) extended the work to multilingual coverage across ~1000 languages. FineWeb matters because data-filtering is where the open-data ecosystem has caught up most visibly to closed-data labs. The released ablations show that careful filtering of Common Crawl can produce a corpus that trains models competitive with what the closed labs publish, at sizes where the comparison is possible. Compared to siblings: The Pile (foundational, smaller), Dolma (AI2, OLMo's training corpus), DCLM (open competitive benchmark for filtering methodology), Common Pile (license-clean angle), RedPajama (open Llama 1 replication corpus). FineWeb is the largest of the well-documented open corpora and the one modern open-weights training runs most often start from. Production-ready and used by many open-weights training projects in 2024-2026. The catch: it is not license-clean (ODC-By does not solve the underlying copyright question for web-scraped content), so it does not satisfy the strictest reading of open-source AI. For that, see Common Pile.

Sources

FineWeb on HuggingFace https://huggingface.co/datasets/HuggingFaceFW/fineweb

FineWeb Blog Post (HuggingFace) https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1

FineWeb-Edu Dataset https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

FineWeb 2 Multilingual https://huggingface.co/datasets/HuggingFaceFW/fineweb-2

Other projects at the Data layer

6 siblings · ordered open first

Common Crawl Open source

Non-profit petabyte-scale monthly web crawl since 2008; the substrate underneath nearly every open AI corpus.

Dolma Open source

AI2's 3T-token training corpus; the substrate for OLMo; redistributable under ODC-BY license (updated from ImpACT in April 2024).

DataComp-LM (DCLM) Open source

Both a benchmark and a corpus; the 'ship the filter, not just the bytes' approach to open data.

RedPajama v1 / v2 Open source

Together AI's open replication of the Llama 1 training mix; v2 reaches 30T tokens.

The Pile Open source

EleutherAI's 2020 825GB open corpus; influential predecessor to all current open-corpus work.

Common Pile v0.1 Open source

EleutherAI's June 2025 license-clean pretraining corpus; one of few entries doing this work at scale.