The Open-Source AI Stack
RSS
All models

Models · Compare

Gemini 2.0 Flash vs Phi-4

Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.

Specs

Field A: Gemini 2.0 Flash B: Phi-4
Released 2024-12-112024-12-12
Developer Google DeepMindMicrosoft
Openness ProprietaryOpen
License ProprietaryMIT
OSI-approved noyes
Data released nono
Training code nono
Architecture unknowndense
Total params 14B
Active params
Experts
Context window 1.0M16K
Attention unknowngqa
Position enc. unknownrope
Pretraining tokens 9.8T
Post-training rlhfsft, dpo
Training hardware
$/M input $0.15$0.13
$/M output $0.60$0.50
Output tok/sec 030.9

Benchmarks

Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.

General reasoning

MMLU 84.8 2024-12-12
MMLU-Pro 77.9 2026-05-21 71.4 2026-05-21
GPQA-Diamond 62.3 2026-05-21 56.1 2024-12-12

Code

HumanEval 82.6 2024-12-12
LiveCodeBench 33.4 2026-05-21 23.1 2026-05-21

Math

MATH 93.0 2026-05-21 80.4 2024-12-12
AIME 2024 33.0 2026-05-21 14.3 2026-05-21
AIME 2025 21.7 2026-05-21 18.0 2026-05-21

Context · A

Google's workhorse Gemini 2.0 checkpoint, released initially as an experimental developer preview. Google positioned it as outperforming Gemini 1.5 Pro on key benchmarks at roughly twice the speed, with native multimodal output (image generation, text-to-speech) and native tool use including Google Search and code execution. The Multimodal Live API added real-time audio and video streaming.

Context · B

14B parameters reaching benchmark scores within range of much larger models, via aggressive synthetic-data pretraining and data quality curation. The Phi line continues the "data quality over scale" thesis.

Gemini 2.0 Flash detail → · Phi-4 detail →