PSA: Qwen3.6 ships with preserve_thinking. Make sure you have it on.

What Happened

Alibaba's Qwen3.6 model ships with a new preserve_thinking flag that keeps the model's chain-of-thought reasoning in context across multi-turn conversations, according to a community post on r/LocalLLaMA by user onil_gova. The flag directly addresses a KV cache invalidation bug present in Qwen 3.5, where reasoning tokens were stripped and re-serialized differently on each turn — breaking cache continuity and degrading agent performance.

Qwen's official model page instructs developers to set "preserve_thinking": True rather than the previous workaround of "chat_template_kwargs": {"preserve_thinking": False}. The change is opt -in and must be explicitly enabled — it does not activate by default.

Why It Matters

KV cache invalidation is not a cosmetic bug. When a model's prior reasoning tokens are stripped mid-session, the inference engine must recompute attention over the full context window on every turn. For long agentic chains — tool calls, multi-step planning, iterative code generation — this translates directly into higher latency and increased token spend.

Per Qwen's documentation cited in the post, preserve_thinking is "particularly beneficial for agent scenarios, where maintaining full reasoning context can enhance decision consistency and, in many cases, reduce overall token consumption by minimizing redundant reasoning." The im plication: teams running Qwen3.6 in production agentic pipelines without this flag are paying a compounding cost on every inference call.

The behavioral difference is concrete. Without the flag, the model loses access to its own previous chain-of-thought entirely — a gap that surfaces as factual inconsistency or apparent amnesia within a single session. With it enabled, the model can reference prior reasoning steps as first-class context, which matters significantly for:

Multi -step tool-calling agents that need to reconcile intermediate outputs
Code generation loops where earlier reasoning about architecture informs later implementation decisions
Any workflow using thinking mode that spans more than one turn

The Technical Detail

The root cause, as described by onil_gova, was in the chat template itself. Qwen 3.5's template stripped thinking tokens before serializing the conversation history, which meant the KV cache — built on a specific byte sequence — was invalidated on every new turn because the serialized form of previous turns changed. Qwen3.6 resolves this at the template level by preserving thinking blocks in their original form, keeping the cache key stable across turns.

The flag can be passed at inference time. For frameworks using chat_template_kwargs, the correct invocation is:

{"preserve_thinking": True}

Validation is straightforward. The community-documented test: prompt the model to generate two 20-digit numbers without tools, then in a follow-up turn ask for the second number. With preserve_thinking off, the model reports no second number exists — it has no access to its prior reasoning. With the flag on, it retrieves the second number immediately from its preserved thinking context .

The flag applies to both thinking and non-thinking inference modes, according to Qwen's model page, meaning the KV cache efficiency gains are not limited to explicit chain-of-thought workflows.

What To Watch

Runtime support is incomplete as of this writing. The r/LocalLLaMA post confirms LM Studio does not yet support the flag. An open pull request exists on the oMLX project to add support, submitted by onil_gova. Developers on other runtimes — v LLM, llama.cpp, Ollama, Transformers — should verify their serving stack handles the flag before assuming it is active.

L M Studio: No support confirmed. Watch for an update in the next release cycle.
oMLX: PR open, not yet merged.
vLLM / Transformers: Status unconfirmed per available information — verify template handling before deploying to production agent pip elines.
Ollama: Status unconfirmed — modelfile-level template custom ization may be required.

Any team running Qwen3.6 in an agentic or multi-turn production context should audit their runtime configuration immediately. Defaulting to the flag being off means you are shipping with the same cache invalidation behavior that existed in 3.5 — and leaving inference efficiency on the table.

PSA: Qwen3.6 ships with preserve_thinking. Make sure you have it on.

What Happened

Why It Matters

The Technical Detail

What To Watch

Related Reading

AI Keeps Forg etting Half Your Docs? DeepSeek Now Reads a Full Book at Once

Quarter the Cost , Same AI Quality : How I Cut Client Bills

AI Tools Move Fast : Workflow Died in 3 Months . A Selection R hythm Saved Me

Qwen3 - 27B on One RTX 3090: 85 TPS, 125K Context , Vision — Overnight

Claude Has a Design Mode Now — My First Thought: "Finally, No More Explaining Myself"

The AI Writing Tool Even Gov't Agencies Use Quietly — We Can Too