Qwen3.6-35B is worse at tool use and reasoning loops than 3.5?

What Happened

A developer running Qwen3.6-35B through agentic coding workflows reported observable regressions in tool-use reliability and reasoning loop behavior compared to Qwen3.5 , according to a post on r/LocalLLaMA published this week . The tester, /u/mr_il, ran the model across multiple quantization formats — 8-bit MLX, Q6_K_XL, Q8_XL, and BF16 — and observed the same failure pattern across all variants.

Testing was conducted via OpenCode agent using o MLX and LM Studio as inference backends. Recommended settings for precision tasks were applied throughout : temperature 0.6, top-k 20. The user's core finding: the model enters infinite reasoning loops more frequently than its predecessor, and fails tool calls at a higher rate.

Why It Matters

Agentic coding workflows — where a model must plan, call tools, evaluate output, and iterate — are among the highest -value use cases driving local LLM adoption among engineers. A model that degresses on loop control is functionally unusable for multi -step tasks regardless of benchmark scores on static evals.

The failure mode described here — defensive over-checking rather than forward task progression — is a known failure class in RLHF-heavy models where reward hacking during training produces verbose, self-auditing outputs that stall execution graphs. If confirmed at scale, this would represent a training regression, not a quantization artifact, since the behavior was consistent across BF16 (full precision) and multiple compressed formats .

The report also flags failed tool calls as a secondary issue, though the t ester notes this could stem from parser bugs in the toolchain rather than the model itself. That distinction matters: a model-side tool-call formatting failure is a capability regression; a parser-side failure is an integration bug fixable downstream.

Scope of the Problem

Infinite reasoning loops observed across all tested quantization levels (8-bit MLX, Q6_K_XL, Q8_XL, BF16)
Failures concentrated on complex tasks — the tester cites a simple 3D game as a breaking point
Basic application generation reportedly unaffected
Tool call failures flagged but caus ation unconfirmed (model vs. parser)

The Technical Detail
The reported behavior — continuous self-rechecking without forward progress — is consistent with models that have been over-optimized for caution during post-training alignment . When a model's reward signal during RLHF or DPO training heavily penalizes wrong answers and under-penalizes non-answers or loops, the model learns to hedge by re-verifying prior steps rather than comm itting to tool calls or code writes.
In agentic loops specifically , this compounds: each re-verification consumes context, and if the model's internal state keeps flag ging uncertainty, it never issues the tool call that would resolve the ambiguity. The result is context exhaustion without task completion — which the tester describes as occurring even with "nearly empty context," suggesting the loop is triggered by task complexity signals, not context pressure.
The fact that BF16 — unquantized full-precision inference — exhibits the same behavior rules out quantization-induced degradation as the root cause. This points to the base model or its fine-tuning, not the compression pipeline.
Testing environment details per the source : `temp=0.6, top_k=20` with OpenCode agent on oMLX and LM Studio . These are conservative inference parameters consistent with Qwen's own recommendations for deterministic task execution.

What To Watch

Community replication: The r/LocalLLaMA thread is the first signal, not the verdict. Watch for cor roborating reports from teams running structured agentic evals — S WE-bench style or tool-call accuracy benchmarks — over the next two weeks.
Alibaba/Qwen response: If loop regression is confirmed at scale, expect a fine-tuned patch release or updated system prompt guidance from the Qwen team. Their cadence on model updates has been fast.
Toolchain investigation: OpenCode and LM Studio maintainers should investigate the parser-side tool call failures independently. A fix there could partially resolve the reported failure rate without waiting on a model update.
Competing releases: If Qwen3.6-35B has a confirmed agentic regression, it reopens space for competing 30B-class models — including Mistral and Meta's Llama family — for local coding agent deployments.

Note: This report is based on a single community tester's findings and has not been independently verified or confirmed by Alibaba. Treat as an early signal requiring replication before drawing conclusions about model quality at scale.

Qwen3.6-35B is worse at tool use and reasoning loops than 3.5?

What Happened

Why It Matters

Scope of the Problem

What To Watch

Related Reading

DeepSeek V4 Launches: Claims Global Open- Source Leadership

GPT- 5.5 Tops Every Benchmark, Edges Out Opus 4.7 — OpenAI Strikes Back

GP T-5.5 Launches : Is Claude Being Pushed Out of China ?

It 's a Big One

Qwen3 .6 27B Ties Claude Sonnet 4.6 on A gentic Benchmark

Alib aba Cloud EMR Serverless Spark Launches Agent Skill for N L -Driven Ops