Models · Compare
Mistral Medium 3 vs Phi-4 Reasoning
Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.
Specs
| Field | A: Mistral Medium 3 | B: Phi-4 Reasoning |
|---|---|---|
| Released | 2025-05-07 | 2025-04-30 |
| Developer | Mistral AI | Microsoft |
| Openness | Proprietary | Open |
| License | Proprietary | MIT |
| OSI-approved | no | yes |
| Data released | no | no |
| Training code | no | no |
| Architecture | unknown | dense |
| Total params | — | 14B |
| Active params | — | — |
| Experts | — | — |
| Context window | 128K | 32K |
| Attention | unknown | unknown |
| Position enc. | unknown | unknown |
| Pretraining tokens | — | 16B |
| Post-training | sft, rlhf | sft, rl |
| Training hardware | — | H100 |
| $/M input | $0.40 | — |
| $/M output | $2.00 | — |
| Output tok/sec | 29 | — |
Benchmarks
Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.
General reasoning
| MMLU-Pro | 76.0 2026-05-21 | 74.3 2025-04-30 |
| GPQA-Diamond | 57.8 2026-05-21 | 65.8 2025-04-30 |
Code
| LiveCodeBench | 40.0 2026-05-21 | 53.8 2025-04-30 |
Math
| MATH | 90.7 2026-05-21 | — |
| AIME 2024 | 44.0 2026-05-21 | 75.3 2025-04-30 |
| AIME 2025 | 30.3 2026-05-21 | 62.9 2025-04-30 |
Context · A
Mid-tier flagship released May 7 2025 at $0.40 / $2.00 per Mtok with a 128K context window. Mistral positioned it as roughly 90 percent of Claude Sonnet 3.7 performance at a fraction of the cost, with deployment supported on self-hosted setups starting at four GPUs.
Context · B
14B reasoning-tuned Phi-4 derivative, SFT-only on curated reasoning traces and synthetic prompts. Trained in 2.5 days on 32 H100-80G GPUs over 16B tokens, with the Plus variant adding an RL stage. Microsoft positioned it as DeepSeek R1 territory at much smaller scale.