What This Is

A post on Reddit's r/LocalLLaMA this week sparked broad discussion—133 upvotes and 52 comments, which counts as a genuine hit in that not oriously demanding community of local model enthusiasts. The author ran Alibaba's open -source Qwen3.6-35B-A3B—a 35-billion-parameter model that activ ates only 3 billion parameters at inference time, thanks to its Mixture-of-Experts (MoE) architecture—on a personal desktop, driven by the coding assistant tool Open Code. The task was non-trivial: review a personal accounting application he had been building for nearly a year, identify technical debt and security vulnerabilities, then fix them directly.

The entire workflow took roughly 50 minutes: 20 minutes to generate a problem report, another 30 to work through the fixes item by item. The same task had previously been handed to the prior-generation Qwen3.5-27B, then Gemma4-31B, then even the much larger Qwen3.5-122B—all of them stalled out. The new model generated output at approximately 50 tokens per second. That is not fast. But getting a model of this capability to run at all on a 16 GB VRAM consumer GPU (RTX 5070 Ti) is itself the development worth noting.

Industry View

The optimistic read is straightforward: MoE architecture—where only a subset of parameters is activated per inference pass, dramatically cutting compute requirements—is turning "running large models locally" from a hobbyist pursuit into a legitimate engineering option. The Qwen open -source line has moved faster than most observers expected. The generational gap between Qwen3.5 and Qwen3.6 on real-world coding tasks is, based on this account, already substantial—and only a few months separate the two releases.

There are, however, important caveats worth keeping in mind. The core evidence here is one user, one project, one machine. This is a case report , not a systematic benchmark. The author himself notes that the new model still misbehaves on certain specifics—most notably, it ign ores the coding tool's "plan mode," jumping straight to writing files rather than proposing a plan for human review , which requires a manual workaround. More fundamentally: when a model operates as a "sub-agent" (breaking tasks into parallel subtasks and executing them autonom ously), users have limited visibility into exactly what it did and did not do. Assessing actual code quality still requires an experienced human reviewer. Using an AI-generated report to evaluate AI -generated work is a closed-loop risk that remains unsolved.

Impact on Regular People

For enterprise IT: A locally deployed model—one that never routes data through external cloud servers—means sensitive data stays inside the internal network. For organizations with compliance requirements, that is a genuine advantage. But the distance between "technically capable" and "trustworthy in a production system" is not closed just because the model improved. A verification process still needs to be built .

For individual professionals: If you can write some code—or are willing to learn— running a genuinely capable coding assistant on consumer hardware is now a realistic option. This does not transform non-programmers into engineers overnight . But it does put people with a working foundation in a position to take on small projects that would previously have required outsourcing.

For the consumer market: Alib aba open-sourcing a model at this scale applies direct down ward pressure on pricing for comparable cloud-based services. For end users, AI coding assistant tools will in all likelihood continue to get cheaper. That said , the "run it on your own machine" path still carries a real hardware threshold and a non-trivial setup cost—it is not a zero-friction option yet.