The Open-Source AI Stack
RSS

Glossary

Mixtral

Mistral AI's MoE model line, with Mixtral 8x7B (the first widely-adopted open mixture-of-experts model) and the larger Mixtral 8x22B as its two flagship releases.

The mixture-of-experts variant of Mistral. Mixtral 8x7B (December 2023) was the first practical open weightsweightsA model release that publishes the trained parameters under some downloadable license, distinct from "open source" which (per OSAID) also requires data and training-code openness. Open full entry MoE model, with 8 experts of 7B parameters each and 2 active per token (roughly 13B active out of 47B total). It outperformed LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry 2 70B on most benchmarks at a fraction of the inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry cost. Mixtral 8x22B (April 2024) scaled the recipe.

The release proved open mixture of expertsweightsA model architecture where each token activates only a fraction of total parameters by routing through learned expert subnetworks, decoupling capacity from compute. Open full entry could match dense baselines at lower serving cost and changed runtime engine roadmaps: vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry , SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry , and TensorRT-LLMruntimeNVIDIA's closed-source inference engine for NVIDIA GPUs, the fastest runtime on Hopper and Blackwell but tied to NVIDIA's proprietary kernel stack and CUDA. Open full entry all rushed to add expert-parallel inferenceruntimeRunning a trained model to produce outputs (tokens, images, embeddings) from inputs at serving time, as distinct from the gradient updates of training. Open full entry support. Apache 2.0governanceA permissive open-source license used by most open-weight model releases (Llama from 4 onward partial, Qwen, Mistral, DeepSeek, Falcon), allowing commercial use without acceptable-use restrictions. Open full entry licensed in both cases.

Sources

Mentioned in

Back to glossary