The Open-Source AI Stack
RSS
All models

Models · nemotron

Llama-3.3-Nemotron Super 49B v1

Open weights NVIDIA · 2025-03-18 · NVIDIA Open Model License

NVIDIA's NAS-distilled Llama 3.3 70B aimed at single-data-center-GPU throughput. Uses skip-attention and variable-FFN blocks selected per layer for the quality-versus-FLOPs tradeoff. Released March 18 2025 with the Llama-Nemotron-Post-Training-Dataset-v1 (30M samples) public.

Cost

$0.00 / Mtok input
$0.00 / Mtok output

· as of 2026-05-21

source ↗

Speed

0 tok/sec output
0 ms TTFT

· as of 2026-05-21

source ↗

Architecture

tokens in Embedding vocab not disclosed · llama3 tokenizer × N layers Attention (not disclosed) RoPE context 131,072 tokens Dense MLP SwiGLU activation (standard) 49B active params Output projection tokens out
Schema-generated from data/models.yaml. Every label is auditable against the model's sources.

Specs

Architecture
dense
Total params
49B
Active params
49B
Context window
131K tokens
Attention
skip-attention
Position encoding
rope
Pretraining tokens
40B
Training hardware
H100
Post-training
sft, grpo
OSI-approved
no
Data released
yes
Training code
not released

Benchmarks

Each score carries the date it was published; we never infer or interpolate missing scores.

General reasoning

MMLU-Pro 69.8 as of 2026-05-21 source ↗
GPQA-Diamond 66.7 as of 2025-03-18 source ↗

Code

LiveCodeBench 28.0 as of 2026-05-21 source ↗

Math

MATH 96.6 as of 2025-03-18 source ↗
AIME 2024 19.3 as of 2026-05-21 source ↗
AIME 2025 58.4 as of 2025-03-18 source ↗

Held-out / arena

IFEval 89.2 as of 2025-03-18 source ↗

Available quantizations

GGUF llama.cpp's container; the common local format, k-quants from Q2 to Q8. runs on llama.cpp, Ollama
MLX Apple MLX 4/8-bit layout for Apple silicon. runs on Apple MLX
FP8 8-bit float, frequently a native release on Hopper / Blackwell GPUs. runs on vLLM, SGLang, TensorRT-LLM

Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.

Notable innovations

  • · Neural Architecture Search distillation from Llama 3.3 70B
  • · Skip-attention and variable-FFN blocks
  • · Toggleable reasoning mode
  • · 30M-sample post-training dataset released

Lineage

NAS-distilled Llama 3.3 70B for single-node deployment.

Sources