The Open-Source AI Stack
RSS
All models

Models · gemma

Gemma 4 26B-A4B

Open weights Google DeepMind · 2026-04-02 · Apache-2.0

The sparse mixture-of-experts sibling in the Gemma 4 family: 26B total parameters but only about 3.8B activated per token. Because single-stream decode is memory-bandwidth-bound, the small active count makes it decode far faster than the dense 31B at the same quant while keeping a large memory footprint (all experts resident), which makes it a strong general-purpose local model.

Architecture

tokens in Embedding vocab not disclosed × 48 layers Grouped-Query Attention RoPE context 262,144 tokens MoE Router ? experts total · ? active per token Output projection tokens out
Schema-generated from data/models.yaml. Every label is auditable against the model's sources.

Specs

Architecture
moe
Total params
26B
Active params
3.8B
Context window
262K tokens
Attention
gqa
Position encoding
rope
Post-training
sft, rlhf
OSI-approved
yes
Data released
no
Training code
not released

Available quantizations

GGUF llama.cpp's container; the common local format, k-quants from Q2 to Q8. runs on llama.cpp, Ollama
AWQ Activation-aware 4-bit weight quantization for GPU serving. runs on vLLM, SGLang
GPTQ Post-training 4-bit weight quantization for GPU serving. runs on vLLM, SGLang, Transformers
EXL2 ExLlamaV2's variable-bitrate format for consumer GPUs. runs on ExLlamaV2
MLX Apple MLX 4/8-bit layout for Apple silicon. runs on Apple MLX
FP8 8-bit float, frequently a native release on Hopper / Blackwell GPUs. runs on vLLM, SGLang, TensorRT-LLM

Verified via the Hugging Face model tree ↗. Community quantizations change over time; the families shown are those with published weights at audit time.

Notable innovations

  • · Sparse MoE with about 3.8B active per token
  • · Sliding-window attention keeps the KV cache small
  • · Apache 2.0 license

Lineage

Sparse-MoE sibling of the dense Gemma 4 31B; 3.8B active per token.

Sources