Models · Compare

Claude 3.7 Sonnet vs Phi-4-mini Instruct

Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.

Specs

Field	A: Claude 3.7 Sonnet	B: Phi-4-mini Instruct
Released	2025-02-24	—
Developer	Anthropic	Microsoft
Openness	Proprietary	Open
License	Proprietary	MIT
OSI-approved	no	yes
Data released	no	no
Training code	no	no
Architecture	unknown	dense
Total params	—	—
Active params	—	—
Experts	—	—
Context window	—	128K
Attention	unknown	gqa
Position enc.	unknown	rope
Pretraining tokens	—	—
Post-training	rlhf, constitutional	sft, dpo
Training hardware	—	A100
$/M input	$3.00	—
$/M output	$15.00	—
Output tok/sec	0	—

Benchmarks

Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.

General reasoning

MMLU	—	67.3 2025-02-26
MMLU-Pro	80.3 2026-05-21	—

Code

SWE-Bench Verified	70.3 2025-02-24	—
LiveCodeBench	39.4 2026-05-21	—

Math

MATH	85.0 2026-05-21	64.0 2025-02-26
AIME 2024	22.3 2026-05-21	—
AIME 2025	21.0 2026-05-21	—

Context · A

The first "hybrid reasoning" model from Anthropic: standard and extended-thinking modes selectable per request. Strong SWE-Bench score made it the default coding-agent backend through 2025.

Context · B

Small-tier Phi 4 released February 2025: 3.8B dense decoder-only with 128K context, 200K vocab, and grouped-query attention. Trained on 5T tokens for 21 days on 512 A100-80G GPUs, with a data cutoff of June 2024. Supports 22 languages.

Claude 3.7 Sonnet detail → · Phi-4-mini Instruct detail →