Models · Compare

Grok 3 vs Phi-4-mini Instruct

Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.

Specs

Field	A: Grok 3	B: Phi-4-mini Instruct
Released	2025-02-17	—
Developer	xAI	Microsoft
Openness	Proprietary	Open
License	Proprietary	MIT
OSI-approved	no	yes
Data released	no	no
Training code	no	no
Architecture	unknown	dense
Total params	—	—
Active params	—	—
Experts	—	—
Context window	131K	128K
Attention	unknown	gqa
Position enc.	unknown	rope
Pretraining tokens	—	—
Post-training	rlhf	sft, dpo
Training hardware	H100	A100
$/M input	$4.00	—
$/M output	$20.00	—
Output tok/sec	0	—

Benchmarks

Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.

General reasoning

MMLU	—	67.3 2025-02-26
MMLU-Pro	79.9 2026-05-21	—
GPQA-Diamond	69.3 2026-05-21	—

Code

LiveCodeBench

42.5 2026-05-21

—

Math

MATH	87.0 2026-05-21	64.0 2025-02-26
AIME 2024	33.0 2026-05-21	—
AIME 2025	58.0 2026-05-21	—

Context · A

xAI's third-generation flagship, trained on the Colossus supercomputer (approximately 200,000 GPUs) with roughly 10x the compute of Grok 2. Released alongside a separate Grok 3 Reasoning variant and a DeepSearch product, with xAI claiming wins over GPT-4o on AIME math and GPQA science benchmarks. API access launched in April 2025.

Context · B

Small-tier Phi 4 released February 2025: 3.8B dense decoder-only with 128K context, 200K vocab, and grouped-query attention. Trained on 5T tokens for 21 days on 512 A100-80G GPUs, with a data cutoff of June 2024. Supports 22 languages.

Grok 3 detail → · Phi-4-mini Instruct detail →