What Happened
Alibaba's Qwen3.6 model ships with a new preserve_thinking flag that keeps the model's chain-of-thought reasoning in context across multi-turn conversations, according to a community post on r/LocalLLaMA by user onil_gova. The flag directly addresses a KV cache invalidation bug present in Qwen 3.5, where reasoning tokens were stripped and re-serialized differently on each turn — breaking cache continuity and degrading agent performance.
Qwen's official model page instructs developers to set "preserve_thinking": True rather than the previous workaround of "chat_template_kwargs": {"preserve_thinking": False}. The change is opt -in and must be explicitly enabled — it does not activate by default.
Why It Matters
KV cache invalidation is not a cosmetic bug. When a model's prior reasoning tokens are stripped mid-session, the inference engine must recompute attention over the full context window on every turn. For long agentic chains — tool calls, multi-step planning, iterative code generation — this translates directly into higher latency and increased token spend.
Per Qwen's documentation cited in the post, preserve_thinking is "particularly beneficial for agent scenarios, where maintaining full reasoning context can enhance decision consistency and, in many cases, reduce overall token consumption by minimizing redundant reasoning." The im plication: teams running Qwen3.6 in production agentic pipelines without this flag are paying a compounding cost on every inference call.
The behavioral difference is concrete. Without the flag, the model loses access to its own previous chain-of-thought entirely — a gap that surfaces as factual inconsistency or apparent amnesia within a single session. With it enabled, the model can reference prior reasoning steps as first-class context, which matters significantly for:
- Multi -step tool-calling agents that need to reconcile intermediate outputs
- Code generation loops where earlier reasoning about architecture informs later implementation decisions
- Any workflow using thinking mode that spans more than one turn
The Technical Detail
The root cause, as described by onil_gova, was in the chat template itself. Qwen 3.5's template stripped thinking tokens before serializing the conversation history, which meant the KV cache — built on a specific byte sequence — was invalidated on every new turn because the serialized form of previous turns changed. Qwen3.6 resolves this at the template level by preserving thinking blocks in their original form, keeping the cache key stable across turns.
The flag can be passed at inference time. For frameworks using chat_template_kwargs, the correct invocation is:
{"preserve_thinking": True}Validation is straightforward. The community-documented test: prompt the model to generate two 20-digit numbers without tools, then in a follow-up turn ask for the second number. With preserve_thinking off, the model reports no second number exists — it has no access to its prior reasoning. With the flag on, it retrieves the second number immediately from its preserved thinking context .
The flag applies to both thinking and non-thinking inference modes, according to Qwen's model page, meaning the KV cache efficiency gains are not limited to explicit chain-of-thought workflows.
What To Watch
Runtime support is incomplete as of this writing. The r/LocalLLaMA post confirms LM Studio does not yet support the flag. An open pull request exists on the oMLX project to add support, submitted by onil_gova. Developers on other runtimes — v LLM, llama.cpp, Ollama, Transformers — should verify their serving stack handles the flag before assuming it is active.
- L M Studio: No support confirmed. Watch for an update in the next release cycle.
- oMLX: PR open, not yet merged.
- vLLM / Transformers: Status unconfirmed per available information — verify template handling before deploying to production agent pip elines.
- Ollama: Status unconfirmed — modelfile-level template custom ization may be required.
Any team running Qwen3.6 in an agentic or multi-turn production context should audit their runtime configuration immediately. Defaulting to the flag being off means you are shipping with the same cache invalidation behavior that existed in 3.5 — and leaving inference efficiency on the table.