The Open-Source AI Stack
RSS
All models

Models · qwen

Qwen 2 72B Instruct

Source-available Alibaba · 2024-06-06 · Tongyi Qianwen License

Predecessor to the Qwen 2.5 family. The 72B class brought 128K-token context via YaRN and added 27 multilingual languages to the pretraining mix, but Alibaba kept the 72B variant under the Tongyi Qianwen License while smaller siblings moved to Apache 2.0.

Cost

$0.00 / Mtok input
$0.00 / Mtok output

Together AI · as of 2026-05-21

via Artificial Analysis ↗

Speed

0 tok/sec output
0 ms TTFT

· as of 2026-05-21

source ↗

Architecture

tokens in Embedding vocab 152,064 · qwen tokenizer × N layers Grouped-Query Attention RoPE context 131,072 tokens Dense MLP SwiGLU activation (standard) 72.7B active params Output projection tokens out
Schema-generated from data/models.yaml. Every label is auditable against the model's sources.

Specs

Architecture
dense
Total params
72.7B
Active params
72.7B
Context window
131K tokens
Attention
gqa
Position encoding
rope
Post-training
sft, rlhf, dpo
OSI-approved
no
Data released
no
Training code
not released

Benchmarks

Each score carries the date it was published; we never infer or interpolate missing scores.

General reasoning

MMLU 82.3 as of 2024-06-07 source ↗
MMLU-Pro 64.4 as of 2024-06-07 source ↗
GPQA-Diamond 37.1 as of 2026-05-21 source ↗

Code

HumanEval 86.0 as of 2024-06-07 source ↗
LiveCodeBench 15.9 as of 2026-05-21 source ↗

Math

MATH 59.7 as of 2024-06-07 source ↗
AIME 2024 14.7 as of 2026-05-21 source ↗

Held-out / arena

IFEval 77.6 as of 2024-06-07 source ↗

Available quantizations

GGUF llama.cpp's container; the common local format, k-quants from Q2 to Q8. runs on llama.cpp, Ollama
AWQ Activation-aware 4-bit weight quantization for GPU serving. runs on vLLM, SGLang
GPTQ Post-training 4-bit weight quantization for GPU serving. runs on vLLM, SGLang, Transformers
EXL2 ExLlamaV2's variable-bitrate format for consumer GPUs. runs on ExLlamaV2
MLX Apple MLX 4/8-bit layout for Apple silicon. runs on Apple MLX
FP8 8-bit float, frequently a native release on Hopper / Blackwell GPUs. runs on vLLM, SGLang, TensorRT-LLM

Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.

Notable innovations

  • · 128K context via YaRN
  • · Online DPO post-training
  • · 27 new multilingual languages

Sources