The Open-Source AI Stack
RSS
All models

Models · Compare

Phi-4 vs Gemini 2.0 Flash

Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.

Specs

Field A: Phi-4 B: Gemini 2.0 Flash
Released 2024-12-122024-12-11
Developer MicrosoftGoogle DeepMind
Openness OpenProprietary
License MITProprietary
OSI-approved yesno
Data released nono
Training code nono
Architecture denseunknown
Total params 14B
Active params
Experts
Context window 16K1.0M
Attention gqaunknown
Position enc. ropeunknown
Pretraining tokens 9.8T
Post-training sft, dporlhf
Training hardware
$/M input $0.13$0.15
$/M output $0.50$0.60
Output tok/sec 30.90

Benchmarks

Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.

General reasoning

MMLU 84.8 2024-12-12
MMLU-Pro 71.4 2026-05-21 77.9 2026-05-21
GPQA-Diamond 56.1 2024-12-12 62.3 2026-05-21

Code

HumanEval 82.6 2024-12-12
LiveCodeBench 23.1 2026-05-21 33.4 2026-05-21

Math

MATH 80.4 2024-12-12 93.0 2026-05-21
AIME 2024 14.3 2026-05-21 33.0 2026-05-21
AIME 2025 18.0 2026-05-21 21.7 2026-05-21

Context · A

14B parameters reaching benchmark scores within range of much larger models, via aggressive synthetic-data pretraining and data quality curation. The Phi line continues the "data quality over scale" thesis.

Context · B

Google's workhorse Gemini 2.0 checkpoint, released initially as an experimental developer preview. Google positioned it as outperforming Gemini 1.5 Pro on key benchmarks at roughly twice the speed, with native multimodal output (image generation, text-to-speech) and native tool use including Google Search and code execution. The Multimodal Live API added real-time audio and video streaming.

Phi-4 detail → · Gemini 2.0 Flash detail →