The Open-Source AI Stack
RSS
All models

Models · qwen

Qwen 3 235B A22B Instruct

Open Alibaba · 2025-04-28 · Apache-2.0

Qwen's first major MoE release, with a hybrid thinking-vs- non-thinking inference mode controllable per request. Apache 2.0 across the size ladder reset the openness baseline among Chinese labs.

Cost

$0.45 / Mtok input
$1.80 / Mtok output

Together AI · as of 2026-05-21

via Artificial Analysis ↗

Speed

66.6 tok/sec output
1113 ms TTFT

Together AI · as of 2026-05-21

via Artificial Analysis ↗

Why people cared

Qwen 3 235B A22B was Alibaba's MoE pivot, and the headline feature was a hybrid thinking-vs-non-thinking inference toggle controllable per request. The schema convention `235B A22B` decodes as 235 billion total parameters and 22 billion active, which puts it in the same operational class as DeepSeek V3 (671B/37B) but at a different cost tier. Apache-2.0 across the entire Qwen 3 size ladder (from 0.6B to 235B) reset the openness baseline among Chinese labs, since DeepSeek's V3 and R1 used a custom DeepSeek License with field-of-use restrictions and Llama remained on community-license terms. The 36T-token pretrain extended Qwen 2.5's 18T, and the post-training stack included GRPO reasoning alongside conventional SFT and DPO. The lasting significance of the release was less about benchmark deltas, which are within noise of DeepSeek V3, and more about establishing that a frontier-grade open MoE could ship under a permissive license from a non-US lab. That made Qwen 3 the default starting point for open-weights agentic work through 2025, especially for organizations whose deployment counsel was uncomfortable with the DeepSeek license's field-of-use clauses.

Architecture

tokens in Embedding vocab not disclosed · qwen tokenizer × N layers Grouped-Query Attention RoPE context 131,072 tokens MoE Router 128 experts total · 8 active per token shown: 32 of 128 Output projection tokens out
Schema-generated from data/models.yaml. Every label is auditable against the model's sources.

Specs

Architecture
moe
Total params
235B
Active params
22B
Experts
128 total · 8 active
Context window
131K tokens
Attention
gqa
Position encoding
rope
Pretraining tokens
36.0T
Post-training
sft, dpo, grpo
OSI-approved
yes
Data released
no
Training code
not released

Benchmarks

Each score carries the date it was published; we never infer or interpolate missing scores.

General reasoning

MMLU-Pro 83.0 as of 2025-04-28 source ↗

Code

LiveCodeBench 34.3 as of 2026-05-21 source ↗

Math

MATH 90.2 as of 2026-05-21 source ↗
AIME 2024 85.7 as of 2025-04-28 source ↗
AIME 2025 23.7 as of 2026-05-21 source ↗

Recommended use cases

  • hybrid reasoning + chat
  • agentic workflows
  • long-context retrieval

Available quantizations

GGUF llama.cpp's container; the common local format, k-quants from Q2 to Q8. runs on llama.cpp, Ollama
AWQ Activation-aware 4-bit weight quantization for GPU serving. runs on vLLM, SGLang
GPTQ Post-training 4-bit weight quantization for GPU serving. runs on vLLM, SGLang, Transformers
EXL2 ExLlamaV2's variable-bitrate format for consumer GPUs. runs on ExLlamaV2
MLX Apple MLX 4/8-bit layout for Apple silicon. runs on Apple MLX
FP8 8-bit float, frequently a native release on Hopper / Blackwell GPUs. runs on vLLM, SGLang, TensorRT-LLM

Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.

Notable innovations

  • · Hybrid thinking mode toggle
  • · 36T-token pretrain
  • · Apache 2.0 across all sizes

Known limitations

  • · Hybrid thinking mode is controllable per request but increases inference cost by 3-10x depending on prompt; cost numbers above reflect non-thinking mode. source ↗

Lineage

First Qwen MoE with hybrid thinking-vs-fast inference modes.

Sources