Glossary

dense

A transformer where every parameter activates on every token; the conventional architecture before mixture of experts became common at frontier scale.

Weights also: Training also: Runtime aka dense transformer, dense model

A transformerruntimeThe neural network architecture that combines self-attention with feed-forward layers, dominant for language modeling since 2017 and the substrate for nearly every modern LLM. Open full entry where every parameter in every layer participates in every forward pass. The feed-forward block is a single MLP, the attentionruntimeThe transformer operation where each token computes a weighted average over all earlier tokens, with weights derived from learned similarity between query and key vectors. Open full entry heads see the full hidden state, and no routing or sparsity is involved. This is the architecture introduced in the original 2017 transformer paper and was the default for every open weightsweightsA model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness. Open full entry frontier model from LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry 1 (Feb 2023) through LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry 3.3 (Dec 2024).

The economic property that defines dense models is that capacity and inference cost scale together. A 70B dense model uses 70B parameters per forward pass; a 405B dense model uses 405B. There is no way to add capacity without paying for it at every token. mixture of expertsweightsA model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute. Open full entry breaks that one-to-one coupling, which is why most 2025 frontier open-weights releases (DeepSeekweightsA Chinese open-weight family known for the V3 MoE base model and the R1 reasoning model, both released under permissive licenses and unusually transparent in their training-cost reporting. Open full entry V3, QwenweightsAlibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters. Open full entry 3 235B, LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry 4) moved to MoE for the largest sizes while keeping dense variants at the 8B and 70B class.

Dense models are still the default at the small end (under ~30B) and for fine-tuning bases, because the routing overhead of MoE only pays off at scale and dense architectures are simpler to quantize, serve on a single GPU, and fine-tune with techniques like LoRAtrainingA parameter-efficient fine-tuning method that injects small low-rank adapter matrices into a frozen base model, training a tiny fraction of weights instead of the full model. Open full entry . LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry 3.3 70B, QwenweightsAlibaba's open-weight model family, leading the multilingual and Chinese-language open-weight space, released under Apache 2.0 with sizes from 0.6B to 235B parameters. Open full entry 2.5 72B, Phi-4, GemmaweightsGoogle's open-weight model family derived from Gemini research, with source-available licensing that includes an acceptable-use clause and license-revocation hook. Open full entry 2 27B, and OLMo 2 13B are all dense releases from the 2024 to 2025 window.

The opposite end of the spectrum is the new generation of MoE releases where total parameters exceed active parameters by 10 to 30 times. Dense models remain the reference architecture against which MoE quality claims are measured.

Sources

Back to glossary