What Happened

A developer spent one week getting Google's Gemma 4 (E2B variant) running on an RTX 3090 with both BF16 full-precision and Q4_K_M GGUF quantized inference via CUDA. Benchmarks show Q4_K_M GGUF outperforms BF16 on throughput: 170 tok/s vs 110 tok/s for short generation (1 prompt token, 32 generated), and 93 tok/s vs 72 tok/s for long generation (512 prompt tokens, 128 generated).

Why It Matters

Gemma 4 uses QK-norm (attention_scale=1.0) instead of the standard 1/sqrt(d_k) scaling, making it approximately 22x more sensitive to floating-point precision errors than LLaMA or Qwen architectures. This is not documented prominently and causes three silent failure modes:

  • F16 KV cache causes output degeneration after roughly 50 tokens due to compounding precision loss
  • Fused attention kernels produce token divergence after approximately 4 decode steps
  • Flash Attention v1 with head_dim=512 returns all-zero logits due to a kernel bug

The working rule: match KV cache dtype to model weight dtype with F32 internal attention math. BF16 model requires BF16 KV cache; F32 GGUF requires F32 KV cache. Mixing dtypes at the KV cache boundary is the root cause of failures. Output verified token-for-token against HuggingFace fixtures for the first 30 tokens.

Asia-Pacific Angle

Chinese and Southeast Asian developers building local inference pipelines often use llama.cpp or custom CUDA backends optimized for Qwen or LLaMA architectures. Gemma 4's hybrid attention (sliding window local plus full global with head_dim=512) and dual RoPE configurations mean those existing kernels will fail silently rather than throw errors. The KV cache sharing across the last N layers saves approximately 57% KV memory, which is significant for consumer GPU deployments common in cost-sensitive Asia-Pacific markets where A100 access is limited. Developers using Alibaba Cloud or Tencent Cloud GPU instances with RTX-class cards should validate output against HuggingFace fixtures before deploying Gemma 4 in production.

Action Item This Week

If you are running Gemma 4 locally, add a 30-token output comparison test against HuggingFace Transformers fixtures before enabling any dtype optimization. Disable Flash Attention v1 explicitly and set KV cache dtype to match your model weight dtype to avoid silent degeneration.