What this is

A developer locally tested Qwen 3.6 against Gemma 4 using vLLM (an LLM inference framework). Despite leading in benchmarks, Qwen lost due to burning 8000+ tokens (units of text processing) in an infinite loop: LLM benchmaxing (optimizing specifically for test sets) is distorting our judgment of AI capabilities. Comparing them with unoptimized data like memes, GeoGuessr, and fitness videos revealed: Qwen overthinks niche questions and burns excessive tokens, while Gemma is more restrained; Gemma significantly outperforms Qwen in following format instructions (e.g., outputting coordinates); they both exhibit cultural biases—Qwen understands Asian memes better, while Gemma is more familiar with European landscapes; however, Qwen performs better in video action tracking (like counting barbell reps).

Industry view

The core judgment here is that the gap between benchmarks and reality is widening, and benchmaxing is hurting LLM usability. Model vendors game test sets for marketing, only for enterprises to discover during deployment that models fail to follow basic instructions and costs spiral out of control. We should care because as benchmarks lose reference value, the trial-and-error costs of model selection will shift heavily onto enterprises. Admittedly, some argue benchmarks still provide a baseline for foundational capabilities, and Qwen's progress in video understanding proves its technical iteration is effective—we cannot dismiss the whole based on isolated scenarios. Yet undeniably, AI video detection remains at a coin-flip level; models contradict themselves when judging authenticity, a common shortcoming of current vision models.

Impact on regular people

For enterprise IT: Selecting models based on benchmarks is extremely risky. We recommend canary testing with real business data, focusing on instruction-following rates and token consumption to avoid being misled by paper specs.

For professionals: Do not blindly trust leaderboards when evaluating AI tools. Running a test in your specific workflow is more useful than reading a dozen reviews.

For the consumer market: Due to training data biases, users in China may find that domestic models provide more grounded, relevant feedback than overseas models when processing images or daily queries within an Asian cultural context.