Llama 3.1 405B Instruct

Cost

— / Mtok input

— / Mtok output

Together AI · as of 2026-05-19

via Artificial Analysis ↗

Speed

— tok/sec output

Together AI · as of 2026-05-19

via Artificial Analysis ↗

Why people cared

Llama 3.1 405B is the first openly-released model trained at GPT-4-class scale. The release was paired with a detailed 92-page technical report covering pretraining recipe, post-training, and infrastructure (including the failure-rate analysis on Meta's 16,384-H100 cluster, which became a reference point for anyone planning frontier training). On benchmarks it lands within reach of contemporaneous closed frontier models on MMLU and reasoning suites, which made it the first time researchers had access to a 400B-class checkpoint's weights for ablation studies and distillation experiments. Its practical deployment story is more constrained: at 810 GB in fp16, single-machine inference requires either multi-GPU sharding or fp8 quantization, and even at fp8 it pushes against the limits of single 8xH100 nodes. That cost-and-complexity ceiling is why the smaller 70B-class checkpoints (3.1 70B and the post-training-refreshed 3.3 70B) capture more production usage. The 405B's lasting impact is the published recipe and the synthetic data the larger checkpoint generated for post-training the 70B and 8B siblings, a pattern subsequent open-weights releases have copied.

Architecture

Schema-generated from data/models.yaml. Every label is auditable against the model's sources.

Specs

Architecture: dense
Total params: 405B
Active params: 405B
Context window: 131K tokens
Attention: gqa
Position encoding: rope-llama3
Pretraining tokens: 15.6T
Training hardware: H100
Post-training: sft, dpo, rejection-sampling
OSI-approved: no
Data released: no
Training code: not released

Benchmarks

Each score carries the date it was published; we never infer or interpolate missing scores.

Recommended use cases

frontier-quality on-prem deployment
synthetic-data generation
teacher for distillation

Available quantizations

GGUF llama.cpp's container; the common local format, k-quants from Q2 to Q8. runs on llama.cpp, Ollama

AWQ Activation-aware 4-bit weight quantization for GPU serving. runs on vLLM, SGLang

GPTQ Post-training 4-bit weight quantization for GPU serving. runs on vLLM, SGLang, Transformers

MLX Apple MLX 4/8-bit layout for Apple silicon. runs on Apple MLX

FP8 8-bit float, frequently a native release on Hopper / Blackwell GPUs. runs on vLLM, SGLang, TensorRT-LLM

bitsandbytes On-the-fly NF4 / INT8 weight quantization inside Transformers. runs on Transformers

Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.

Notable innovations

· First open-weights model at 400B+ scale
· Detailed published tech report

Known limitations

· 405B parameters at fp16 require 810 GB of VRAM; serving at fp8 still pushes single-node limits. source ↗
· Llama Community License's 700M-MAU clause makes the largest deployers ineligible without a separate Meta agreement. source ↗

Lineage

Largest dense Llama; trained with 16K H100s on the same data mix as the 8B and 70B.