Llama 4 Scout · Models · The Open-Source AI Stack

Cost

$0.17 / Mtok input

$0.66 / Mtok output

Together AI · as of 2026-05-21

via Artificial Analysis ↗

Speed

108.1 tok/sec output

561 ms TTFT

Together AI · as of 2026-05-21

via Artificial Analysis ↗

Why people cared

Llama 4 Scout is Meta's first MoE release, and the public reception was mixed enough to be a story in itself. The headline features were a 10M-token context window (the longest in any frontier model at release) and native multimodal input, both delivered alongside an architecture pivot that followed DeepSeek V3 in establishing MoE as mainstream for open-weights frontier work. Independent evaluators noted that the announced benchmark scores were not consistently reproducible at deployment, that the 10M-token context was achievable in theory but degraded at lengths well below the stated maximum, and that the LMArena ranking the launch material featured was for a chat-tuned variant not available as weights. The release itself remains historically significant as the moment Llama abandoned dense scaling, but the immediate developer narrative was that DeepSeek V3 and Qwen 3 had executed the open-weights MoE story more cleanly five months earlier. Llama 4 Scout's lasting value depends on whether the Maverick and Behemoth siblings shipped on the same architecture deliver on the long-context and multimodal promises in production deployment.

Architecture

Schema-generated from data/models.yaml. Every label is auditable against the model's sources.

Specs

Architecture: moe
Total params: 109B
Active params: 17B
Experts: 16 total · 1 active
Context window: 10.5M tokens
Attention: gqa
Position encoding: rope-llama3
Pretraining tokens: 40.0T
Training hardware: H100
Post-training: sft, dpo, rejection-sampling
OSI-approved: no
Data released: no
Training code: not released

Benchmarks

Each score carries the date it was published; we never infer or interpolate missing scores.

Code

LiveCodeBench

29.9

as of 2026-05-21

source ↗

Math

MATH	84.4	as of 2026-05-21	source ↗
AIME 2024	28.3	as of 2026-05-21	source ↗
AIME 2025	14.0	as of 2026-05-21	source ↗

Recommended use cases

long-context tasks
multimodal input
frontier-class inference at MoE economics

Available quantizations

GGUF llama.cpp's container; the common local format, k-quants from Q2 to Q8. runs on llama.cpp, Ollama

AWQ Activation-aware 4-bit weight quantization for GPU serving. runs on vLLM, SGLang

MLX Apple MLX 4/8-bit layout for Apple silicon. runs on Apple MLX

FP8 8-bit float, frequently a native release on Hopper / Blackwell GPUs. runs on vLLM, SGLang, TensorRT-LLM

bitsandbytes On-the-fly NF4 / INT8 weight quantization inside Transformers. runs on Transformers

Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.

Notable innovations

· 10M-token context window
· Native multimodal input
· First Llama MoE

Known limitations

· 10M-token context window degrades at lengths well below the stated maximum in independent evaluations. source ↗

Lineage

First Llama MoE; same Community License lineage.

Derived from

Llama 3.3 70B Instruct 2024-12-06

Sources

Llama 4 announcement (Meta AI, Apr 5 2025) ↗