The Open-Source AI Stack
RSS
All models

Models · Compare

Claude 3.7 Sonnet vs Phi-4-mini Instruct

Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.

Specs

Field A: Claude 3.7 Sonnet B: Phi-4-mini Instruct
Released 2025-02-24
Developer AnthropicMicrosoft
Openness ProprietaryOpen
License ProprietaryMIT
OSI-approved noyes
Data released nono
Training code nono
Architecture unknowndense
Total params
Active params
Experts
Context window 128K
Attention unknowngqa
Position enc. unknownrope
Pretraining tokens
Post-training rlhf, constitutionalsft, dpo
Training hardware A100
$/M input $3.00
$/M output $15.00
Output tok/sec 0

Benchmarks

Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.

General reasoning

MMLU 67.3 2025-02-26
MMLU-Pro 80.3 2026-05-21

Code

SWE-Bench Verified 70.3 2025-02-24
LiveCodeBench 39.4 2026-05-21

Math

MATH 85.0 2026-05-21 64.0 2025-02-26
AIME 2024 22.3 2026-05-21
AIME 2025 21.0 2026-05-21

Context · A

The first "hybrid reasoning" model from Anthropic: standard and extended-thinking modes selectable per request. Strong SWE-Bench score made it the default coding-agent backend through 2025.

Context · B

Small-tier Phi 4 released February 2025: 3.8B dense decoder-only with 128K context, 200K vocab, and grouped-query attention. Trained on 5T tokens for 21 days on 512 A100-80G GPUs, with a data cutoff of June 2024. Supports 22 languages.

Claude 3.7 Sonnet detail → · Phi-4-mini Instruct detail →