Glossary

FineWeb

An open large-scale web text dataset from Hugging Face, the highest-quality permissively-licensed pretraining corpus by 2024 to 2026 with ~15 trillion tokens after deduplication and filtering.

Data also: Training

The pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry dataset that displaced RedPajamadataAn early open reproduction of the Llama 1 pretraining corpus from Together AI (2023), now superseded by FineWeb and Dolma but historically important as the first open frontier-scale dataset. Open full entry and The PiledataAn 825 GB diverse-source pretraining dataset assembled by EleutherAI in 2020, the open-corpus precedent that the later RedPajama and FineWeb projects expanded on. Open full entry as the default open-corpus choice in 2024. FineWeb is built from Common Crawl with extensive quality filtering (language identification, deduplication, quality classifiers trained on edu-like content), and the FineWeb-Edu subset further filters for educational content.

The release is permissively licensed and includes the full preprocessing pipeline (Hugging FacetrainingThe model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer. Open full entry ’s datatrove library) so teams can reproduce or modify the filtering. OLMo 2, several open-research models, and many community fine-tunes train on FineWeb derivatives.

Full coverage at /projects/fineweb.

Sources

FineWeb: decanting the web for the finest text data at scale (Hugging Face)

Mentioned in

Back to glossary