Qwen 3.6 Wins Benchmarks, Fails Reality: Benchmaxing Distorts AI Perception

What this is

A developer locally tested Qwen 3.6 against Gemma 4 using vLLM (an LLM inference framework). Despite leading in benchmarks, Qwen lost due to burning 8000+ tokens (units of text processing) in an infinite loop: LLM benchmaxing (optimizing specifically for test sets) is distorting our judgment of AI capabilities. Comparing them with unoptimized data like memes, GeoGuessr, and fitness videos revealed: Qwen overthinks niche questions and burns excessive tokens, while Gemma is more restrained; Gemma significantly outperforms Qwen in following format instructions (e.g., outputting coordinates); they both exhibit cultural biases—Qwen understands Asian memes better, while Gemma is more familiar with European landscapes; however, Qwen performs better in video action tracking (like counting barbell reps).

Industry view

The core judgment here is that the gap between benchmarks and reality is widening, and benchmaxing is hurting LLM usability. Model vendors game test sets for marketing, only for enterprises to discover during deployment that models fail to follow basic instructions and costs spiral out of control. We should care because as benchmarks lose reference value, the trial-and-error costs of model selection will shift heavily onto enterprises. Admittedly, some argue benchmarks still provide a baseline for foundational capabilities, and Qwen's progress in video understanding proves its technical iteration is effective—we cannot dismiss the whole based on isolated scenarios. Yet undeniably, AI video detection remains at a coin-flip level; models contradict themselves when judging authenticity, a common shortcoming of current vision models.

Impact on regular people

For enterprise IT: Selecting models based on benchmarks is extremely risky. We recommend canary testing with real business data, focusing on instruction-following rates and token consumption to avoid being misled by paper specs.

For professionals: Do not blindly trust leaderboards when evaluating AI tools. Running a test in your specific workflow is more useful than reading a dozen reviews.

For the consumer market: Due to training data biases, users in China may find that domestic models provide more grounded, relevant feedback than overseas models when processing images or daily queries within an Asian cultural context.

Qwen 3.6 Wins Benchmarks, Fails Reality: Benchmaxing Distorts AI Perception

What this is

Industry view

Impact on regular people

Related Reading

Qwen3.6 Single-GPU Deep Search 95.7%: Local Matches Perplexity, Tool Use Beats Size

Open-Source Hybrid Recall Tool Gives Agents Memory Without Giant Contexts

Single 3090 Runs Qwen3 Natively on Windows: Local LLMs Drop Linux Requirement

Ollama Runs Local LLMs on Mac with One Command — PCs Are the New AI Gateway

Qwen 3.6 Replaces Copilot Locally: Zero API Cost, But Novices Beware

AI Will Precisely Drop Databases Without Noticing—We Haven't Taught AI to Say No