Models · Compare

Mistral Medium 3 vs Phi-4 Reasoning

Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.

Specs

Field	A: Mistral Medium 3	B: Phi-4 Reasoning
Released	2025-05-07	2025-04-30
Developer	Mistral AI	Microsoft
Openness	Proprietary	Open
License	Proprietary	MIT
OSI-approved	no	yes
Data released	no	no
Training code	no	no
Architecture	unknown	dense
Total params	—	14B
Active params	—	—
Experts	—	—
Context window	128K	32K
Attention	unknown	unknown
Position enc.	unknown	unknown
Pretraining tokens	—	16B
Post-training	sft, rlhf	sft, rl
Training hardware	—	H100
$/M input	$0.40	—
$/M output	$2.00	—
Output tok/sec	29	—

Benchmarks

Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.

General reasoning

MMLU-Pro	76.0 2026-05-21	74.3 2025-04-30
GPQA-Diamond	57.8 2026-05-21	65.8 2025-04-30

Code

LiveCodeBench

40.0 2026-05-21

53.8 2025-04-30

Math

MATH	90.7 2026-05-21	—
AIME 2024	44.0 2026-05-21	75.3 2025-04-30
AIME 2025	30.3 2026-05-21	62.9 2025-04-30

Context · A

Mid-tier flagship released May 7 2025 at $0.40 / $2.00 per Mtok with a 128K context window. Mistral positioned it as roughly 90 percent of Claude Sonnet 3.7 performance at a fraction of the cost, with deployment supported on self-hosted setups starting at four GPUs.

Context · B

14B reasoning-tuned Phi-4 derivative, SFT-only on curated reasoning traces and synthetic prompts. Trained in 2.5 days on 32 H100-80G GPUs over 16B tokens, with the Plus variant adding an RL stage. Microsoft positioned it as DeepSeek R1 territory at much smaller scale.

Mistral Medium 3 detail → · Phi-4 Reasoning detail →