Models · Compare
Llama 3.1 Tülu 3 70B vs OpenAI o1
Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.
Specs
| Field | A: Llama 3.1 Tülu 3 70B | B: OpenAI o1 |
|---|---|---|
| Released | 2024-11-21 | — |
| Developer | AI2 | OpenAI |
| Openness | Open weights | Proprietary |
| License | Llama 3.1 Community License | Proprietary |
| OSI-approved | no | no |
| Data released | yes | no |
| Training code | yes | no |
| Architecture | dense | unknown |
| Total params | 70B | — |
| Active params | — | — |
| Experts | — | — |
| Context window | — | — |
| Attention | gqa | unknown |
| Position enc. | rope-llama3 | unknown |
| Pretraining tokens | — | — |
| Post-training | sft, dpo, rlvr | rlhf |
| Training hardware | — | — |
| $/M input | — | $15.00 |
| $/M output | — | $60.00 |
| Output tok/sec | — | 75.8 |
Benchmarks
Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.
General reasoning
| MMLU | 83.1 2024-11-21 | — |
| MMLU-Pro | — | 84.1 2026-05-21 |
| GPQA-Diamond | — | 77.3 2024-12-05 |
Code
| HumanEval | 92.4 2024-11-21 | — |
| LiveCodeBench | — | 67.9 2026-05-21 |
Math
| MATH | 63.0 2024-11-21 | 97.0 2026-05-21 |
| AIME 2024 | — | 83.3 2024-12-05 |
Held-out / arena
| IFEval | 83.2 2024-11-21 | — |
Context · A
AI2's flagship demonstration that the open community could match closed instruct recipes. Post-trained on top of Llama 3.1 70B with SFT, DPO, and the new RLVR (Reinforcement Learning with Verifiable Rewards) stage. Recipes, data, code, and infrastructure all open even though the weights carry Llama Community License inherited from the base.
Context · B
The first publicly available frontier reasoning model. Trained to spend extra inference compute on a "private chain of thought" before answering, setting the template the open community would chase with R1.