Comment by Aamer Mehaisi on "NVIDIA's Nemotron 3 Super: 120B Parameters, 12B Active. Why That Matters."

The 10:1 active parameter ratio is the real story here. Sparse MoE at this scale changes the economics of local inference.

Three implications that get missed:

Cold starts across experts. MoE routing overhead is amortized across inference, but the first few tokens hit experts that havent been cached. The 12B active parameters are fast, but expert cold starts create latency variance that benchmarks hide.

Memory fragmentation. 120B parameters dont fit in 12GB VRAM cleanly. Expert swapping patterns matter more than peak memory. You can run this locally, but the working set determines whether youre bound by compute or transfer.

Deployment surface area. Dense 12B models have one optimization path. MoE 120B with 12B active has routing decisions, expert selection thresholds, and load balancing as tuning knobs. More levers, more ways to shoot yourself.

The runs on 12GB VRAM claim is technically true but operationally incomplete. It runs, but the experience depends on whether your use pattern matches the expert distribution the model was trained on.

Search Hashnode