What Happened
A LocalLLaMA community member documented four reproducible bugs in Qwen 3.5 tool calling after hundreds of log analysis sessions using llama.cpp, Ollama, and vLLM. The findings were synthesized with Claude Opus 4.6 and validated against live servers. The specific stack that achieved 99% reliability: Pi coding agent + llama.cpp + Bartowski Q5_K_L quants.
- Bug 1 – XML leakage: Qwen 3.5 emits tool calls as raw XML (
<function=bash>). When text precedes the XML tag or thinking mode is enabled, servers returnfinish_reason: stopinstead of parsing the call. The agent never executes the tool. - Bug 2 – Thinking block contamination: Tool calls emitted inside
<think>blocks are invisible to the server parser. llama.cpp issue #20837 is still open. - Bug 3 – Ollama partial fix: Ollama issue #14745 patched some cases but still occasionally prints tool calls as plain text in streaming mode.
- Bug 4 – vLLM streaming drops opening brace: vLLM issue #35266 causes malformed JSON tool calls during streaming, breaking downstream parsers.
Why It Matters
Qwen 3.5 is one of the most capable open-weight model families for coding agents and function-calling pipelines, but these bugs make it unreliable in production agentic loops without workarounds. Indie developers building coding assistants, browser agents, or API orchestration tools on local inference will hit these failures silently — the model appears to respond but no tool executes. The fix requires both server-side patches (some still pending) and client-side prompt engineering.
Asia-Pacific Angle
Qwen 3.5 is developed by Alibaba Cloud and is the dominant open-weight choice for Chinese and Southeast Asian developers due to its strong multilingual performance and permissive licensing. Teams in China, Vietnam, Indonesia, and Singapore building local-first AI agents — often to avoid OpenAI API costs or data residency issues — are disproportionately affected by these bugs. The recommended stack (llama.cpp + Bartowski quants + Q5_K_L quantization) runs on consumer hardware common in the region. Developers using Nano-GPT or similar lightweight inference servers should apply client-side XML parsing patches immediately, as server-side fixes are not yet merged upstream.
Action Item This Week
If you run Qwen 3.5 with llama.cpp, pin to a Bartowski Q5_K_L quant, disable thinking mode during tool-calling loops, and add a client-side parser that detects raw <function= XML output and re-routes it as a tool call — do not rely on finish_reason: tool_calls alone until llama.cpp issues #20260 and #20837 are closed.