Models · Compare

Phi-3 Medium 4K Instruct vs GPT-4o

Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.

Specs

Field	A: Phi-3 Medium 4K Instruct	B: GPT-4o
Released	2024-05-21	2024-05-13
Developer	Microsoft	OpenAI
Openness	Open	Proprietary
License	MIT	Proprietary
OSI-approved	yes	no
Data released	no	no
Training code	no	no
Architecture	dense	unknown
Total params	14B	—
Active params	—	—
Experts	—	—
Context window	4K	—
Attention	mha	unknown
Position enc.	rope	unknown
Pretraining tokens	4.8T	—
Post-training	sft, dpo	rlhf
Training hardware	H100	—
$/M input	—	$2.50
$/M output	—	$10.00
Output tok/sec	—	131.6

Benchmarks

Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.

General reasoning

MMLU	78.0 2024-05-21	—
MMLU-Pro	—	74.8 2026-05-21
GPQA-Diamond	—	54.3 2026-05-21

Code

HumanEval	62.2 2024-05-21	—
LiveCodeBench	—	30.9 2026-05-21

Math

MATH	—	75.9 2026-05-21
AIME 2024	—	15.0 2026-05-21
AIME 2025	—	6.0 2026-05-21

Context · A

Microsoft's 14B follow-up to Phi-3 Mini, trained on 4.8T tokens across 42 days on 512 H100s. Sat at MMLU 78 at release, on par with Llama 3 8B Instruct.

Context · B

Native-multimodal model with audio + vision + text in a single pretrained backbone. Pushed real-time voice latency to under 400ms; the multimodal benchmark anchor through 2024.

Phi-3 Medium 4K Instruct detail → · GPT-4o detail →