Models · Compare

Gemini 2.0 Flash vs Phi-4

Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.

Specs

Field	A: Gemini 2.0 Flash	B: Phi-4
Released	2024-12-11	2024-12-12
Developer	Google DeepMind	Microsoft
Openness	Proprietary	Open
License	Proprietary	MIT
OSI-approved	no	yes
Data released	no	no
Training code	no	no
Architecture	unknown	dense
Total params	—	14B
Active params	—	—
Experts	—	—
Context window	1.0M	16K
Attention	unknown	gqa
Position enc.	unknown	rope
Pretraining tokens	—	9.8T
Post-training	rlhf	sft, dpo
Training hardware	—	—
$/M input	$0.15	$0.13
$/M output	$0.60	$0.50
Output tok/sec	0	30.9

Benchmarks

Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.

General reasoning

MMLU	—	84.8 2024-12-12
MMLU-Pro	77.9 2026-05-21	71.4 2026-05-21
GPQA-Diamond	62.3 2026-05-21	56.1 2024-12-12

Code

HumanEval	—	82.6 2024-12-12
LiveCodeBench	33.4 2026-05-21	23.1 2026-05-21

Math

MATH	93.0 2026-05-21	80.4 2024-12-12
AIME 2024	33.0 2026-05-21	14.3 2026-05-21
AIME 2025	21.7 2026-05-21	18.0 2026-05-21

Context · A

Google's workhorse Gemini 2.0 checkpoint, released initially as an experimental developer preview. Google positioned it as outperforming Gemini 1.5 Pro on key benchmarks at roughly twice the speed, with native multimodal output (image generation, text-to-speech) and native tool use including Google Search and code execution. The Multimodal Live API added real-time audio and video streaming.

Context · B

14B parameters reaching benchmark scores within range of much larger models, via aggressive synthetic-data pretraining and data quality curation. The Phi line continues the "data quality over scale" thesis.

Gemini 2.0 Flash detail → · Phi-4 detail →