Why people cared
Mixtral 8x7B was the first MoE release that the broader open-weights community could actually deploy. Earlier MoE work (Switch Transformer, GLaM) had stayed inside Google; Mixtral arrived as Apache-2.0 weights and a paper with enough detail for production inference engines (vLLM, llama.cpp, TGI) to add MoE support within weeks. The architecture is eight Mistral 7B experts behind a router that activates two per token, giving 47B total parameters but only ~13B active per forward pass. The economic story that mattered was that this configuration matched or beat dense 70B-class models on most benchmarks at substantially lower inference cost. Mixtral established the pattern that DeepSeek V3, Qwen 3, and eventually Llama 4 all followed: when a lab cares about cost-per-token at deployment scale more than parameter-count bragging rights, MoE is the obvious choice. The Apache-2.0 license also mattered because Mistral's later flagship releases moved progressively toward API-only and source-available terms, making 8x7B the last clean reference point for a fully-open MoE from a European lab.
Architecture
data/models.yaml. Every label is auditable
against the model's sources.
Specs
- Architecture
- moe
- Total params
- 46.7B
- Active params
- 12.9B
- Experts
- 8 total · 2 active
- Context window
- 33K tokens
- Attention
- gqa
- Position encoding
- rope
- Post-training
- sft, dpo
- OSI-approved
- yes
- Data released
- no
- Training code
- not released
Benchmarks
Each score carries the date it was published; we never infer or interpolate missing scores.
Recommended use cases
- MoE deployment reference
- dense 70B-class quality at 13B active cost
Available quantizations
Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.
Notable innovations
- · First widely-deployed open-weights MoE
- · Apache-2.0
Known limitations
- · 47B total parameters require enough memory to hold all eight experts even though only 2 are active per token. source ↗
Lineage
Shares Mistral 7B architecture, with eight experts under one router; 2 active per token.