The Open-Source AI Stack
RSS
All models

Models · llama

Llama 3.3 70B Instruct

Source-available Meta · 2024-12-06 · Llama 3.3 Community License

An incremental post-training refresh of the 70B class that approached 405B-class quality on several benchmarks without the deployment cost. The last dense Llama before Llama 4 went MoE.

Cost

/ Mtok input
/ Mtok output

Together AI · as of 2026-05-19

via Artificial Analysis ↗

Speed

tok/sec output

Together AI · as of 2026-05-19

via Artificial Analysis ↗

Why people cared

Llama 3.3 70B is what happens when a lab spends six months on post-training the same base checkpoint. Meta did not release a new pretrain in December 2024; they released a re-trained 70B that reached benchmark scores within range of the 405B sibling on MMLU and HumanEval, at a fraction of the deployment cost. The recipe (published only in the model card, not a paper) emphasized synthetic data from the 405B teacher, additional preference data, and refined rejection sampling. The release lands at an interesting moment for the open-weights story: 70B-class checkpoints from Meta, Qwen, DeepSeek, and Mistral are now close enough on most benchmarks that license, vendor support, and inference economics matter more than capability deltas. Llama 3.3 70B became the default 70B-class checkpoint through 2025 for most production workloads where Apache-2.0 was not a hard requirement. It is also the final dense Llama: the Llama 4 family that followed in April 2025 went MoE, following DeepSeek V3 and Qwen 3.

Architecture

tokens in Embedding vocab 128,256 · llama3 tokenizer × N layers Grouped-Query Attention RoPE (Llama 3 scaling) context 131,072 tokens Dense MLP SwiGLU activation (standard) 70.6B active params Output projection tokens out
Schema-generated from data/models.yaml. Every label is auditable against the model's sources.

Specs

Architecture
dense
Total params
70B
Active params
70.6B
Context window
131K tokens
Attention
gqa
Position encoding
rope-llama3
Pretraining tokens
15.0T
Training hardware
H100
Post-training
sft, dpo, rejection-sampling
OSI-approved
no
Data released
no
Training code
not released

Benchmarks

Each score carries the date it was published; we never infer or interpolate missing scores.

General reasoning

MMLU 86.0 as of 2024-12-06 source ↗
GPQA-Diamond 50.5 as of 2024-12-06 source ↗

Code

HumanEval 88.4 as of 2024-12-06 source ↗

Recommended use cases

  • mid-tier production deployment
  • code assistance
  • general chat at 70B cost

Available quantizations

GGUF llama.cpp's container; the common local format, k-quants from Q2 to Q8. runs on llama.cpp, Ollama
AWQ Activation-aware 4-bit weight quantization for GPU serving. runs on vLLM, SGLang
GPTQ Post-training 4-bit weight quantization for GPU serving. runs on vLLM, SGLang, Transformers
EXL2 ExLlamaV2's variable-bitrate format for consumer GPUs. runs on ExLlamaV2
MLX Apple MLX 4/8-bit layout for Apple silicon. runs on Apple MLX
FP8 8-bit float, frequently a native release on Hopper / Blackwell GPUs. runs on vLLM, SGLang, TensorRT-LLM
bitsandbytes On-the-fly NF4 / INT8 weight quantization inside Transformers. runs on Transformers

Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.

Notable innovations

  • · Approached 405B-class quality at 70B compute cost

Known limitations

  • · Still under Llama Community License, not OSI-approved. source ↗

Lineage

Same base; post-training refresh closed much of the 70B-vs-405B gap.

Sources