Models · Compare

Llama 3.1 Tülu 3 70B vs OpenAI o1

Rows highlighted in warm gray are where the models differ. Numbers carry their as-of date and primary source.

Specs

Field	A: Llama 3.1 Tülu 3 70B	B: OpenAI o1
Released	2024-11-21	—
Developer	AI2	OpenAI
Openness	Open weights	Proprietary
License	Llama 3.1 Community License	Proprietary
OSI-approved	no	no
Data released	yes	no
Training code	yes	no
Architecture	dense	unknown
Total params	70B	—
Active params	—	—
Experts	—	—
Context window	—	—
Attention	gqa	unknown
Position enc.	rope-llama3	unknown
Pretraining tokens	—	—
Post-training	sft, dpo, rlvr	rlhf
Training hardware	—	—
$/M input	—	$15.00
$/M output	—	$60.00
Output tok/sec	—	75.8

Benchmarks

Missing scores render as not reported; never inferred. Bold highlights the leader per benchmark.

General reasoning

MMLU	83.1 2024-11-21	—
MMLU-Pro	—	84.1 2026-05-21
GPQA-Diamond	—	77.3 2024-12-05

Code

HumanEval	92.4 2024-11-21	—
LiveCodeBench	—	67.9 2026-05-21

Math

MATH	63.0 2024-11-21	97.0 2026-05-21
AIME 2024	—	83.3 2024-12-05

Held-out / arena

IFEval

83.2 2024-11-21

—

Context · A

AI2's flagship demonstration that the open community could match closed instruct recipes. Post-trained on top of Llama 3.1 70B with SFT, DPO, and the new RLVR (Reinforcement Learning with Verifiable Rewards) stage. Recipes, data, code, and infrastructure all open even though the weights carry Llama Community License inherited from the base.

Context · B

The first publicly available frontier reasoning model. Trained to spend extra inference compute on a "private chain of thought" before answering, setting the template the open community would chase with R1.

Llama 3.1 Tülu 3 70B detail → · OpenAI o1 detail →