The Open-Source AI Stack
RSS
All models

Models · llama

Llama 4 Scout

Source-available Meta · 2025-04-05 · Llama 4 Community License

Llama's first MoE family. Scout's headline was a 10M-token context window and natively multimodal vision-language input. The MoE pivot followed DeepSeek V3 in establishing MoE as mainstream for open-weights frontier work.

Cost

$0.17 / Mtok input
$0.66 / Mtok output

Together AI · as of 2026-05-21

via Artificial Analysis ↗

Speed

108.1 tok/sec output
561 ms TTFT

Together AI · as of 2026-05-21

via Artificial Analysis ↗

Why people cared

Llama 4 Scout is Meta's first MoE release, and the public reception was mixed enough to be a story in itself. The headline features were a 10M-token context window (the longest in any frontier model at release) and native multimodal input, both delivered alongside an architecture pivot that followed DeepSeek V3 in establishing MoE as mainstream for open-weights frontier work. Independent evaluators noted that the announced benchmark scores were not consistently reproducible at deployment, that the 10M-token context was achievable in theory but degraded at lengths well below the stated maximum, and that the LMArena ranking the launch material featured was for a chat-tuned variant not available as weights. The release itself remains historically significant as the moment Llama abandoned dense scaling, but the immediate developer narrative was that DeepSeek V3 and Qwen 3 had executed the open-weights MoE story more cleanly five months earlier. Llama 4 Scout's lasting value depends on whether the Maverick and Behemoth siblings shipped on the same architecture deliver on the long-context and multimodal promises in production deployment.

Architecture

tokens in Embedding vocab not disclosed · llama4 tokenizer × N layers Grouped-Query Attention RoPE (Llama 3 scaling) context 10,485,760 tokens MoE Router 16 experts total · 1 active per token Output projection tokens out
Schema-generated from data/models.yaml. Every label is auditable against the model's sources.

Specs

Architecture
moe
Total params
109B
Active params
17B
Experts
16 total · 1 active
Context window
10.5M tokens
Attention
gqa
Position encoding
rope-llama3
Pretraining tokens
40.0T
Training hardware
H100
Post-training
sft, dpo, rejection-sampling
OSI-approved
no
Data released
no
Training code
not released

Benchmarks

Each score carries the date it was published; we never infer or interpolate missing scores.

Code

LiveCodeBench 29.9 as of 2026-05-21 source ↗

Math

MATH 84.4 as of 2026-05-21 source ↗
AIME 2024 28.3 as of 2026-05-21 source ↗
AIME 2025 14.0 as of 2026-05-21 source ↗

Recommended use cases

  • long-context tasks
  • multimodal input
  • frontier-class inference at MoE economics

Available quantizations

GGUF llama.cpp's container; the common local format, k-quants from Q2 to Q8. runs on llama.cpp, Ollama
AWQ Activation-aware 4-bit weight quantization for GPU serving. runs on vLLM, SGLang
MLX Apple MLX 4/8-bit layout for Apple silicon. runs on Apple MLX
FP8 8-bit float, frequently a native release on Hopper / Blackwell GPUs. runs on vLLM, SGLang, TensorRT-LLM
bitsandbytes On-the-fly NF4 / INT8 weight quantization inside Transformers. runs on Transformers

Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.

Notable innovations

  • · 10M-token context window
  • · Native multimodal input
  • · First Llama MoE

Known limitations

  • · 10M-token context window degrades at lengths well below the stated maximum in independent evaluations. source ↗

Lineage

First Llama MoE; same Community License lineage.

Sources