Glossary

pretraining

The first and most compute-expensive training phase, where a base model learns general capabilities by predicting the next token on trillions of words of web and book data.

Training also: Data also: Weights aka pre-training

The phase where a transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry learns to predict the next token over a large unlabeled text corpus. No instruction-following, no preferences, no special tasks: just minimizing cross-entropy loss over trillions of tokens of web text, books, code, and scientific papers. The model that falls out is a base model, useful only as a starting point for the post-trainingtrainingEverything that happens after pretraining ends: supervised fine-tuning, preference optimization, red-teaming, distillation, and safety work that turns a base into a shippable assistant. Open full entry work that comes after.

Pretraining is where the compute lives. A frontierweightsThe current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold. Open full entry spends tens of millions of dollars and weeks on thousands of GPUs in this phase. The Chinchilla scaling law (Hoffmann et al., 2022) found that the compute-optimal ratio is roughly 20 tokens per parameter, which guided the design of LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry , MistralweightsA French open-weight model family from Mistral AI, released mostly under Apache 2.0 with strong performance per parameter and notable MoE variants (Mixtral, Mixtral 8x22B). Open full entry , and most open weightsweightsA model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness. Open full entry families since.

In the open-source ecosystem pretraining is concentrated at a few labs with cluster access: Meta, MistralweightsA French open-weight model family from Mistral AI, released mostly under Apache 2.0 with strong performance per parameter and notable MoE variants (Mixtral, Mixtral 8x22B). Open full entry , QwenweightsAlibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters. Open full entry , DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry , AI2 (OLMo). Most downstream open work starts from these released checkpoints.

Sources

Mentioned in

Back to glossary