The Open-Source AI Stack
RSS
All models

Models · granite

Granite 4.0 Small

Open IBM · 2025-10-02 · Apache-2.0

IBM's first hybrid Mamba/Transformer release, October 29 2025. Granite-4.0-H-Small is a 32B MoE activating 9B per token, with Mamba-2 and conventional transformer blocks layered 9:1. Shipped alongside H-Tiny (7B / 1B) and H-Micro (3B dense), all base + instruct, all Apache 2.0.

Architecture

tokens in Embedding vocab not disclosed × N layers Attention (not disclosed) Position encoding not disclosed context 131,072 tokens Dense MLP SwiGLU activation (standard) 9B active params Output projection tokens out
Schema-generated from data/models.yaml. Every label is auditable against the model's sources.

Specs

Architecture
hybrid-mamba-transformer-moe
Total params
32B
Active params
9B
Context window
131K tokens
Attention
hybrid-mamba2-transformer
Position encoding
unknown
Post-training
sft, rlhf
OSI-approved
yes
Data released
no
Training code
not released

Available quantizations

GGUF llama.cpp's container; the common local format, k-quants from Q2 to Q8. runs on llama.cpp, Ollama
AWQ Activation-aware 4-bit weight quantization for GPU serving. runs on vLLM, SGLang
MLX Apple MLX 4/8-bit layout for Apple silicon. runs on Apple MLX
FP8 8-bit float, frequently a native release on Hopper / Blackwell GPUs. runs on vLLM, SGLang, TensorRT-LLM
bitsandbytes On-the-fly NF4 / INT8 weight quantization inside Transformers. runs on Transformers

Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.

Notable innovations

  • · First IBM hybrid Mamba-2 / Transformer stack
  • · 9:1 Mamba-to-Transformer layer ratio
  • · Single-H100 throughput for 32B MoE per IBM

Lineage

First IBM hybrid Mamba/Transformer model; flagship of the Granite 4.0 family. Architectural break from the conventional-transformer Granite 3 series.

Sources