Models · Compare
OpenAI o1 vs Llama 3.3 70B Instruct
Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.
Specs
| Field | A: OpenAI o1 | B: Llama 3.3 70B Instruct |
|---|---|---|
| Released | — | 2024-12-06 |
| Developer | OpenAI | Meta |
| Openness | Proprietary | Source-available |
| License | Proprietary | Llama 3.3 Community License |
| OSI-approved | no | no |
| Data released | no | no |
| Training code | no | no |
| Architecture | unknown | dense |
| Total params | — | 70B |
| Active params | — | — |
| Experts | — | — |
| Context window | — | 131K |
| Attention | unknown | gqa |
| Position enc. | unknown | rope-llama3 |
| Pretraining tokens | — | 15.0T |
| Post-training | rlhf | sft, dpo, rejection-sampling |
| Training hardware | — | H100 |
| $/M input | $15.00 | — |
| $/M output | $60.00 | — |
| Output tok/sec | 75.8 | — |
Benchmarks
Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.
General reasoning
| MMLU | — | 86.0 2024-12-06 |
| MMLU-Pro | 84.1 2026-05-21 | — |
| GPQA-Diamond | 77.3 2024-12-05 | 50.5 2024-12-06 |
Code
| HumanEval | — | 88.4 2024-12-06 |
| LiveCodeBench | 67.9 2026-05-21 | — |
Math
| MATH | 97.0 2026-05-21 | — |
| AIME 2024 | 83.3 2024-12-05 | — |
Context · A
The first publicly available frontier reasoning model. Trained to spend extra inference compute on a "private chain of thought" before answering, setting the template the open community would chase with R1.
Context · B
An incremental post-training refresh of the 70B class that approached 405B-class quality on several benchmarks without the deployment cost. The last dense Llama before Llama 4 went MoE.