Article Not Found

Best Local LLM for Agentic Coding on a Single RTX 4090

What Happened

A developer running an RTX 4090 with 64GB DDR5 RAM tested three quantized models for agentic coding workflows using llama.cpp with Google's turbo quant method: GLM-4.7 Flash Q4_K_M (30B), Nemotron-3 Q4_K_M (30B), and Qwen3-Coder-Next Q4_K_M (80B). Despite expectations, the 80B Qwen3-Coder-Next produced frequent low-level errors requiring manual intervention, while the two 30B models delivered more reliable throughput for sustained agentic loops at full context window.

Why It Matters

Agentic coding differs from single-turn completion — the model runs in a loop, calls tools, reads file diffs, and self-corrects. This punishes models with inconsistent instruction-following even if their benchmark scores look strong. For indie devs and small teams running local inference, a 30B Q4_K_M model that fits cleanly in 24GB VRAM with stable token-per-second rates is often more productive than a larger model that hallucinates tool calls. The 4090's 24GB VRAM is a hard ceiling: 30B Q4_K_M sits around 18-20GB, leaving headroom for KV cache at long contexts.

Asia-Pacific Angle

GLM-4.7 is developed by Zhipu AI (Beijing) and is specifically trained with strong Chinese-English bilingual instruction following, making it a practical pick for developers in China, Taiwan, or Southeast Asia who work in mixed-language codebases or need to generate code comments and documentation in Chinese. Qwen3-Coder is Alibaba's model and also handles Chinese prompts natively — but the community finding here suggests the 80B quant may need further tuning or a better quantization strategy before it's reliable in agentic loops. Developers in the region should test GLM-4.7 Flash against their specific language mix before committing to a workflow.

Action Item This Week

Pull GLM-4.7-Flash-Q4_K_M and Nemotron-3-30B-Q4_K_M via llama.cpp, run both on a 10-step agentic coding task (file read → edit → test loop), and measure tool-call error rate and tokens/sec. Use that data — not benchmarks — to pick your daily driver.

Best Local LLM for Agentic Coding on a Single RTX 4090

What Happened

Why It Matters

Asia-Pacific Angle

Action Item This Week

相关推荐

客户聊天记录太长、 AI 总「断片」？ De epSeek 新版能一口气读完一本书的内容了

同样的AI 对话质量，费用只要四分之一 — 我最近在帮客户省这笔钱

AI 工具换得太快，我的工作流三个月就过时了 — 一个选工具的思路帮我稳住了

Qwen3 - 27B on One RTX 3090: 85 TPS, 125K Context , Vision — Overnight

高盛警告：标普500指数已经约等于半个“AI指数”

DeepSeek V4 Launches: Claims Global Open- Source Leadership