The Open-Source AI Stack
RSS
All models

Models · llama

Llama 3.1 405B Instruct

Source-available Meta · 2024-07-23 · Llama 3.1 Community License

The first open-weights model trained at GPT-4-comparable scale. Pretraining used 16K H100 GPUs and a published tech report; the release reset expectations for how open the frontier could be.

Cost

/ Mtok input
/ Mtok output

Together AI · as of 2026-05-19

via Artificial Analysis ↗

Speed

tok/sec output

Together AI · as of 2026-05-19

via Artificial Analysis ↗

Why people cared

Llama 3.1 405B is the first openly-released model trained at GPT-4-class scale. The release was paired with a detailed 92-page technical report covering pretraining recipe, post-training, and infrastructure (including the failure-rate analysis on Meta's 16,384-H100 cluster, which became a reference point for anyone planning frontier training). On benchmarks it lands within reach of contemporaneous closed frontier models on MMLU and reasoning suites, which made it the first time researchers had access to a 400B-class checkpoint's weights for ablation studies and distillation experiments. Its practical deployment story is more constrained: at 810 GB in fp16, single-machine inference requires either multi-GPU sharding or fp8 quantization, and even at fp8 it pushes against the limits of single 8xH100 nodes. That cost-and-complexity ceiling is why the smaller 70B-class checkpoints (3.1 70B and the post-training-refreshed 3.3 70B) capture more production usage. The 405B's lasting impact is the published recipe and the synthetic data the larger checkpoint generated for post-training the 70B and 8B siblings, a pattern subsequent open-weights releases have copied.

Architecture

tokens in Embedding vocab 128,256 · llama3 tokenizer × N layers Grouped-Query Attention RoPE (Llama 3 scaling) context 131,072 tokens Dense MLP SwiGLU activation (standard) 405B active params Output projection tokens out
Schema-generated from data/models.yaml. Every label is auditable against the model's sources.

Specs

Architecture
dense
Total params
405B
Active params
405B
Context window
131K tokens
Attention
gqa
Position encoding
rope-llama3
Pretraining tokens
15.6T
Training hardware
H100
Post-training
sft, dpo, rejection-sampling
OSI-approved
no
Data released
no
Training code
not released

Benchmarks

Each score carries the date it was published; we never infer or interpolate missing scores.

Recommended use cases

  • frontier-quality on-prem deployment
  • synthetic-data generation
  • teacher for distillation

Available quantizations

GGUF llama.cpp's container; the common local format, k-quants from Q2 to Q8. runs on llama.cpp, Ollama
AWQ Activation-aware 4-bit weight quantization for GPU serving. runs on vLLM, SGLang
GPTQ Post-training 4-bit weight quantization for GPU serving. runs on vLLM, SGLang, Transformers
MLX Apple MLX 4/8-bit layout for Apple silicon. runs on Apple MLX
FP8 8-bit float, frequently a native release on Hopper / Blackwell GPUs. runs on vLLM, SGLang, TensorRT-LLM
bitsandbytes On-the-fly NF4 / INT8 weight quantization inside Transformers. runs on Transformers

Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.

Notable innovations

  • · First open-weights model at 400B+ scale
  • · Detailed published tech report

Known limitations

  • · 405B parameters at fp16 require 810 GB of VRAM; serving at fp8 still pushes single-node limits. source ↗
  • · Llama Community License's 700M-MAU clause makes the largest deployers ineligible without a separate Meta agreement. source ↗

Lineage

Largest dense Llama; trained with 16K H100s on the same data mix as the 8B and 70B.

Sources