Models · Compare

Phi-4 vs Gemini 2.0 Flash

Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.

Specs

Field	A: Phi-4	B: Gemini 2.0 Flash
Released	2024-12-12	2024-12-11
Developer	Microsoft	Google DeepMind
Openness	Open	Proprietary
License	MIT	Proprietary
OSI-approved	yes	no
Data released	no	no
Training code	no	no
Architecture	dense	unknown
Total params	14B	—
Active params	—	—
Experts	—	—
Context window	16K	1.0M
Attention	gqa	unknown
Position enc.	rope	unknown
Pretraining tokens	9.8T	—
Post-training	sft, dpo	rlhf
Training hardware	—	—
$/M input	$0.13	$0.15
$/M output	$0.50	$0.60
Output tok/sec	30.9	0

Benchmarks

Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.

General reasoning

MMLU	84.8 2024-12-12	—
MMLU-Pro	71.4 2026-05-21	77.9 2026-05-21
GPQA-Diamond	56.1 2024-12-12	62.3 2026-05-21

Code

HumanEval	82.6 2024-12-12	—
LiveCodeBench	23.1 2026-05-21	33.4 2026-05-21

Math

MATH	80.4 2024-12-12	93.0 2026-05-21
AIME 2024	14.3 2026-05-21	33.0 2026-05-21
AIME 2025	18.0 2026-05-21	21.7 2026-05-21

Context · A

14B parameters reaching benchmark scores within range of much larger models, via aggressive synthetic-data pretraining and data quality curation. The Phi line continues the "data quality over scale" thesis.

Context · B

Google's workhorse Gemini 2.0 checkpoint, released initially as an experimental developer preview. Google positioned it as outperforming Gemini 1.5 Pro on key benchmarks at roughly twice the speed, with native multimodal output (image generation, text-to-speech) and native tool use including Google Search and code execution. The Multimodal Live API added real-time audio and video streaming.

Phi-4 detail → · Gemini 2.0 Flash detail →