What Happened
A developer running an RTX 4090 with 64GB DDR5 RAM tested three quantized models for agentic coding workflows using llama.cpp with Google's turbo quant method: GLM-4.7 Flash Q4_K_M (30B), Nemotron-3 Q4_K_M (30B), and Qwen3-Coder-Next Q4_K_M (80B). Despite expectations, the 80B Qwen3-Coder-Next produced frequent low-level errors requiring manual intervention, while the two 30B models delivered more reliable throughput for sustained agentic loops at full context window.
Why It Matters
Agentic coding differs from single-turn completion — the model runs in a loop, calls tools, reads file diffs, and self-corrects. This punishes models with inconsistent instruction-following even if their benchmark scores look strong. For indie devs and small teams running local inference, a 30B Q4_K_M model that fits cleanly in 24GB VRAM with stable token-per-second rates is often more productive than a larger model that hallucinates tool calls. The 4090's 24GB VRAM is a hard ceiling: 30B Q4_K_M sits around 18-20GB, leaving headroom for KV cache at long contexts.
Asia-Pacific Angle
GLM-4.7 is developed by Zhipu AI (Beijing) and is specifically trained with strong Chinese-English bilingual instruction following, making it a practical pick for developers in China, Taiwan, or Southeast Asia who work in mixed-language codebases or need to generate code comments and documentation in Chinese. Qwen3-Coder is Alibaba's model and also handles Chinese prompts natively — but the community finding here suggests the 80B quant may need further tuning or a better quantization strategy before it's reliable in agentic loops. Developers in the region should test GLM-4.7 Flash against their specific language mix before committing to a workflow.
Action Item This Week
Pull GLM-4.7-Flash-Q4_K_M and Nemotron-3-30B-Q4_K_M via llama.cpp, run both on a 10-step agentic coding task (file read → edit → test loop), and measure tool-call error rate and tokens/sec. Use that data — not benchmarks — to pick your daily driver.