The Open-Source AI Stack
RSS
All models

Models · Compare

Mistral Medium 3 vs Phi-4 Reasoning

Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.

Specs

Field A: Mistral Medium 3 B: Phi-4 Reasoning
Released 2025-05-072025-04-30
Developer Mistral AIMicrosoft
Openness ProprietaryOpen
License ProprietaryMIT
OSI-approved noyes
Data released nono
Training code nono
Architecture unknowndense
Total params 14B
Active params
Experts
Context window 128K32K
Attention unknownunknown
Position enc. unknownunknown
Pretraining tokens 16B
Post-training sft, rlhfsft, rl
Training hardware H100
$/M input $0.40
$/M output $2.00
Output tok/sec 29

Benchmarks

Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.

General reasoning

MMLU-Pro 76.0 2026-05-21 74.3 2025-04-30
GPQA-Diamond 57.8 2026-05-21 65.8 2025-04-30

Code

LiveCodeBench 40.0 2026-05-21 53.8 2025-04-30

Math

MATH 90.7 2026-05-21
AIME 2024 44.0 2026-05-21 75.3 2025-04-30
AIME 2025 30.3 2026-05-21 62.9 2025-04-30

Context · A

Mid-tier flagship released May 7 2025 at $0.40 / $2.00 per Mtok with a 128K context window. Mistral positioned it as roughly 90 percent of Claude Sonnet 3.7 performance at a fraction of the cost, with deployment supported on self-hosted setups starting at four GPUs.

Context · B

14B reasoning-tuned Phi-4 derivative, SFT-only on curated reasoning traces and synthetic prompts. Trained in 2.5 days on 32 H100-80G GPUs over 16B tokens, with the Plus variant adding an RL stage. Microsoft positioned it as DeepSeek R1 territory at much smaller scale.

Mistral Medium 3 detail → · Phi-4 Reasoning detail →