20 hours of testing, two RTX PRO 6000 GPUs at full load, and the conclusion boils down to two words: it depends on the scenario. Qwen3.6-27B (Tongyi Qianwen's latest dense model, supporting chain-of-thought) and Coder-Next (MoE architecture, ~35B total parameters but only 3B activated per run) completed 30 and 25 out of 40 tasks, respectively—statistically a tie, but the ways they win are completely different.

What this is

Suspecting that traditional benchmarks were being "gamed" (via targeted optimization for public benchmark datasets), the tester decided to compare the two models using self-built, high-intensity tasks. There are three core findings:

First, their strengths are drastically different. On open-ended market research tasks, 27B scored 8/10 while Coder-Next scored 0/10—a massive gap. But switch to bounded business memos and document integration, and Coder-Next passed 10/10, with a single-run cost only 1/60th to 1/100th of 27B's.

Second, the most counterintuitive result: 27B with "thinking mode" disabled (i.e., not outputting the reasoning process, giving conclusions directly) was actually the most stable configuration across the board—achieving a 95.8% completion rate across 12 test groups. The thinking process didn't change the final decisions; it just made the output more verbose. However, on tasks like document integration that easily trigger repetitive loops, disabling thinking indeed halved the failure rate.

Third, the same-series 3.6-35B-A3B (also an MoE architecture) performed so poorly it wasn't worth testing further. Both are MoE, but Coder-Next is competitive while A3B is not—architecture is not a silver bullet label.

Industry view

We note that the value of this test lies in its methodology, not its conclusions. As more and more models push past 90 on public leaderboards like MMLU and HumanEval, differentiation can only come from this kind of grueling, hands-on scenario testing. The tester explicitly stated that his motivation was a distrust of benchmarks.

MoE's cost advantage received hard numerical support in this kind of real-world test—a 60x to 100x cost difference is no small matter. For scenarios requiring high-frequency calls and clear task boundaries (like batch generating standard reports), this gap directly determines whether a solution can go into production.

But we must also acknowledge the limitations: this is a non-standardized test by a single person, and the sample and methodology have not been peer-reviewed. There is currently only one data point for whether Coder-Next's 0/10 collapse on market research tasks is reproducible or related to prompt style. Furthermore, while the finding that "disabling thinking improves stability" is interesting, the tester himself admitted this may be related to the specific distribution of task types, and cannot simply be generalized to "thinking mode is useless."

Impact on regular people

For enterprise IT: Model selection is shifting from "buying the strongest model" to "building model combos by scenario"—dense models for open-ended exploration, MoE for pipeline tasks, with a cost difference of one to two orders of magnitude.

For individual careers: Practical judgments like understanding "when to turn off thinking mode" are becoming more valuable for decision-making than "whose benchmark is higher."

For the consumer market: The disconnect between benchmarks and actual performance will continue to widen. Vendors' leaderboard marketing will become increasingly untrustworthy, and the credibility of third-party scenario testing will rise.