Models · Compare
Grok 3 vs Phi-4-mini Instruct
Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.
Specs
| Field | A: Grok 3 | B: Phi-4-mini Instruct |
|---|---|---|
| Released | 2025-02-17 | — |
| Developer | xAI | Microsoft |
| Openness | Proprietary | Open |
| License | Proprietary | MIT |
| OSI-approved | no | yes |
| Data released | no | no |
| Training code | no | no |
| Architecture | unknown | dense |
| Total params | — | — |
| Active params | — | — |
| Experts | — | — |
| Context window | 131K | 128K |
| Attention | unknown | gqa |
| Position enc. | unknown | rope |
| Pretraining tokens | — | — |
| Post-training | rlhf | sft, dpo |
| Training hardware | H100 | A100 |
| $/M input | $4.00 | — |
| $/M output | $20.00 | — |
| Output tok/sec | 0 | — |
Benchmarks
Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.
General reasoning
| MMLU | — | 67.3 2025-02-26 |
| MMLU-Pro | 79.9 2026-05-21 | — |
| GPQA-Diamond | 69.3 2026-05-21 | — |
Code
| LiveCodeBench | 42.5 2026-05-21 | — |
Math
| MATH | 87.0 2026-05-21 | 64.0 2025-02-26 |
| AIME 2024 | 33.0 2026-05-21 | — |
| AIME 2025 | 58.0 2026-05-21 | — |
Context · A
xAI's third-generation flagship, trained on the Colossus supercomputer (approximately 200,000 GPUs) with roughly 10x the compute of Grok 2. Released alongside a separate Grok 3 Reasoning variant and a DeepSearch product, with xAI claiming wins over GPT-4o on AIME math and GPQA science benchmarks. API access launched in April 2025.
Context · B
Small-tier Phi 4 released February 2025: 3.8B dense decoder-only with 128K context, 200K vocab, and grouped-query attention. Trained on 5T tokens for 21 days on 512 A100-80G GPUs, with a data cutoff of June 2024. Supports 22 languages.