The Open-Source AI Stack
RSS

The Stack · Data · Open source

FineWeb / FineWeb-Edu

HuggingFace 15T-token filtered Common Crawl; FineWeb-Edu is the higher-quality educational subset.

ODC-By 1.0 · stable · Project site →

FineWeb is HuggingFace's open pretraining dataset. ~15 trillion tokens after filtering Common Crawl with a series of quality classifiers and dedup steps. Released under ODC-By 1.0 (Open Data Commons attribution license). FineWeb-Edu is a higher- quality educational subset filtered with an LLM classifier; smaller (~1.3T tokens) but reliably stronger for training small- to-mid models. FineWeb 2 (late 2024) extended the work to multilingual coverage across ~1000 languages. FineWeb matters because data-filtering is where the open-data ecosystem has caught up most visibly to closed-data labs. The released ablations show that careful filtering of Common Crawl can produce a corpus that trains models competitive with what the closed labs publish, at sizes where the comparison is possible. Compared to siblings: The Pile (foundational, smaller), Dolma (AI2, OLMo's training corpus), DCLM (open competitive benchmark for filtering methodology), Common Pile (license-clean angle), RedPajama (open Llama 1 replication corpus). FineWeb is the largest of the well-documented open corpora and the one modern open-weights training runs most often start from. Production-ready and used by many open-weights training projects in 2024-2026. The catch: it is not license-clean (ODC-By does not solve the underlying copyright question for web-scraped content), so it does not satisfy the strictest reading of open-source AI. For that, see Common Pile.

Sources

Want a follow-up? Ask the chat about FineWeb / FineWeb-Edu in context. It will compare to siblings at the same layer and ground every claim in the wiki.

Other projects at the Data layer

6 siblings · ordered open first