The Open-Source AI Stack
RSS

Glossary

RedPajama

An early open reproduction of the Llama 1 pretraining corpus from Together AI (2023), now superseded by FineWeb and Dolma but historically important as the first open frontier-scale dataset.

Together AI’s April 2023 attempt to reproduce the LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry 1 training corpus as an open release. RedPajama v1 (1.2 trillion tokens) combined Common CrawldataA nonprofit-run repeated crawl of the public web maintained since 2007, the upstream raw source for nearly every open web-scale pretraining corpus. Open full entry , C4, Wikipedia, books, ArXiv, GitHub, and StackExchange slices in the proportions documented in the LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry 1 paper. RedPajama- v2 (October 2023) scaled to 30 trillion filtered and deduplicated tokens with additional quality scoring.

The corpus opened the path for open-recipe pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry . The data itself is no longer the leading choice (FineWebdataAn open large-scale web text dataset from Hugging Face, the highest-quality permissively-licensed pretraining corpus by 2024 to 2026 with ~15 trillion tokens after deduplication and filtering. Open full entry has supplanted it for new training runs), but RedPajama proved out the feasibility of community-scale open pretrainingtrainingThe first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data. Open full entry data.

Sources

Mentioned in

Back to glossary