What Happened

A developer running an NVIDIA DGX Spark GB10 posted to r/LocalLLaMA asking for working vLLM configurations for Google's Gemma 4 26B-A4B model. The Intel INT4 quantization of the 31B variant loaded successfully but delivered unacceptable inference speed. No confirmed working setup for the 26B variant was shared at time of writing.

Why It Matters

Gemma 4 26B-A4B is a Mixture-of-Experts model that activates only 4B parameters per token, making it theoretically efficient for local deployment. However, vLLM's MoE support and quantization compatibility are still maturing, and hardware like the DGX Spark GB10 (Grace Blackwell, 128GB unified memory) does not always map cleanly to community-tested configs.

  • INT4 quants reduce VRAM pressure but can bottleneck on CPU-GPU memory bandwidth on unified architectures.
  • vLLM's --quantization flag behavior differs between AWQ, GPTQ, and Intel Neural Compressor formats.
  • The 26B-A4B checkpoint requires correct MoE routing support in vLLM 0.4.x or later.

Asia-Pacific Angle

Chinese and Southeast Asian developers frequently deploy open-weight models on cost-effective local hardware rather than cloud APIs, making quantization performance critical. Gemma 4's permissive license allows commercial use, which matters for indie SaaS products targeting regional markets. Developers using Alibaba Cloud or Tencent Cloud GPU instances (A10, A100) should note that AWQ quantization typically outperforms GPTQ on these SKUs for MoE models. The Qwen team's public benchmarks on MoE quantization are a useful reference point for tuning similar architectures.

Action Item This Week

Try loading Gemma 4 26B-A4B with vllm serve google/gemma-4-26b-a4b --quantization awq --max-model-len 8192 --tensor-parallel-size 1 and compare throughput against the INT4 variant. Report tokens/sec at batch size 1 to isolate the memory bandwidth bottleneck.