The Open-Source AI Stack
RSS
All models

Models · nemotron

Nemotron 3 Nano

Open weights NVIDIA · 2025-12-15 · NVIDIA Open Model License

First Nemotron 3 family member, a 31.6B-total / 3.2B-active hybrid Mamba-Transformer MoE. 52 layers (23 MoE, 23 Mamba-2, 6 GQA), 128 routed plus 1 shared expert with 6 activated per token. NVIDIA reports 3.3x Qwen3-30B-A3B throughput on a single H200 at 8K input / 16K output, with 1M-context support. Super and Ultra variants planned for H1 2026.

Architecture

tokens in Embedding vocab not disclosed × 52 layers Attention (not disclosed) Position encoding not disclosed context 1,000,000 tokens Dense MLP SwiGLU activation (standard) 3.5B active params Output projection tokens out
Schema-generated from data/models.yaml. Every label is auditable against the model's sources.

Specs

Architecture
hybrid-mamba-transformer-moe
Total params
30B
Active params
3.5B
Experts
128 total · 6 active
Context window
1.0M tokens
Attention
hybrid-mamba2-gqa
Position encoding
unknown
Post-training
sft, rlhf
OSI-approved
no
Data released
no
Training code
not released

Available quantizations

GGUF llama.cpp's container; the common local format, k-quants from Q2 to Q8. runs on llama.cpp, Ollama
AWQ Activation-aware 4-bit weight quantization for GPU serving. runs on vLLM, SGLang
GPTQ Post-training 4-bit weight quantization for GPU serving. runs on vLLM, SGLang, Transformers
MLX Apple MLX 4/8-bit layout for Apple silicon. runs on Apple MLX
FP8 8-bit float, frequently a native release on Hopper / Blackwell GPUs. runs on vLLM, SGLang, TensorRT-LLM

Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.

Notable innovations

  • · Hybrid Mamba-2 + Transformer + MoE in one stack
  • · 1M context with single-H200 throughput claim
  • · 23:23:6 layer split (MoE / Mamba-2 / GQA)

Lineage

First Nemotron 3; hybrid Mamba-Transformer-MoE stack for agentic workloads.

Sources