Cost
DeepSeek API · as of 2026-05-21
Why people cared
DeepSeek V3 reset the cost-quality frontier on December 26, 2024, and is the model that the January 2025 "DeepSeek moment" was actually about. The technical report disclosed a reported $5.6M pretraining run on H800 GPUs (which were the export-controlled variant available to Chinese labs, not the H100s used by US frontier labs), and the resulting checkpoint matched closed-frontier scores on MMLU, GPQA-Diamond, and HumanEval. Three architectural innovations carried the story: Multi-head Latent Attention compressed KV-cache memory by ~93%, an auxiliary-loss-free load balancing mechanism kept MoE expert utilization smooth without the convergence problems earlier MoE work hit, and multi-token prediction during pretraining served as both a training-signal amplifier and a deployment-time speculative decoding accelerator. The economic argument that landed on Wall Street was that frontier capability had been reproduced for less than 1% of what US labs were widely reported to be spending on equivalent training runs. The model itself shipped with a custom DeepSeek License (not OSI-approved), but the technical report's level of detail set a new bar for what an open-weights frontier release should look like.
Architecture
data/models.yaml. Every label is auditable
against the model's sources.
Specs
- Architecture
- moe
- Total params
- 671B
- Active params
- 37B
- Experts
- 256 total · 8 active
- Context window
- 128K tokens
- Attention
- mla
- Position encoding
- rope-yarn
- Pretraining tokens
- 14.8T
- Training hardware
- H800
- Post-training
- sft, grpo
- OSI-approved
- no
- Data released
- no
- Training code
- not released
Benchmarks
Each score carries the date it was published; we never infer or interpolate missing scores.
Recommended use cases
- frontier-quality chat at sub-$1/M output
- long-context tasks
- code assistance
Available quantizations
Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.
Notable innovations
- · FP8 mixed-precision pretraining
- · Auxiliary-loss-free load balancing
- · Multi-token prediction
Known limitations
Lineage
New pretrain using MLA and MoE architectures validated in DeepSeek-V2; base for the R1 reasoning model.
Derivatives