What This Is

A Reddit user gave Qwen 3.6 27B and Gemma 4 31B the same prompt: code a Pac-Man web game. On an M5 Max MacBook, Qwen spent 18 minutes spitting out 33,946 tokens (the smallest unit of text generated by the model), producing lengthy code with flashy visuals; Gemma took only 3 minutes and 51 seconds and 6,209 tokens, yet delivered clearer game logic, smoother collisions, and more reasonable ghost behavior. The judges ruled Gemma the winner—the model that output more actually lost.

Industry View

This test hit a blind spot in our evaluation systems: benchmarks all compare "whether the answer is correct," rarely "whether the answer is efficient." Supporters argue this is the core value of local deployment—quickly getting usable results under limited compute, showing Google's engineering optimizations for model efficiency are paying off.

But opposing voices are equally worth hearing: this is a single test with too small a sample; Qwen's long output contained more complex animation logic, and the results might reverse for creative design tasks; moreover, the two models may be optimized for different scenarios, making direct comparison unfair. The judgment we care about more is that "efficiency" should become a new dimension for model selection. In local deployment, every excess token is electricity and time.

Impact on Regular People

For enterprise IT: Don't just look at benchmark rankings when selecting models; run real tasks on your own hardware. Generation efficiency directly impacts concurrency and server costs.

For individual professionals: When using local models, "concise" is more practical than "detailed." Learning to write constraining prompts (e.g., "no more than 200 lines of code") is more cost-effective than upgrading to a larger model.

For the consumer market: As local AI hardware like M-series chips proliferates, the commercial value of "lightweight and efficient" models will continue to rise. This isn't a technical preference; it's cost logic.