What Happened
A LocalLLaMA community member ran a structured 100-task vision benchmark comparing Gemma-4 E4B against Qwen3.5-4B. The test suite covers screenshot OCR, geographic inference from travel photos, and grocery price extraction from shelf images — all single-turn, no tools, with definitive answers. Results: Qwen3.5-4B scored 0.50 (the calibrated baseline), while Gemma-4 E4B scored 0.27. Testing was conducted first via llama.cpp (build b8680, image-min-tokens set to 1120 per official Gemma-4 docs) using Q8 quants from Unsloth and Bartowski, then repeated with Hugging Face Transformers to rule out quantization artifacts. Both backends produced similarly poor results. On a canonical geoguessing test using an image from Google's own Gemma-4 blog post, E4B via llama.cpp returned no answer; via Transformers it returned "Rome, Italy" instead of the correct "Venice, Italy." Qwen3.5-4B, Qwen3.5-9B, and GLM-4.6V Flash all answered correctly.
Why It Matters
For indie developers and SMEs building multimodal pipelines, model selection at the 4B parameter tier is a critical cost-efficiency decision. A score of 0.27 — nearly half the baseline — means Gemma-4 E4B would introduce unacceptable error rates in production vision tasks such as document parsing, receipt processing, or image-based data extraction. The benchmark methodology (partial credit scoring, diverse task types) is more representative of real agentic workloads than standard VQA leaderboards, making this finding operationally significant rather than academic.
Asia-Pacific Angle
Chinese and Southeast Asian developers building document-heavy or e-commerce applications — think receipt OCR for expense tools, product image parsing for Shopee or Lazada integrations, or Chinese-language screenshot extraction — should note that Qwen3.5-4B and GLM-4.6V Flash both outperformed Gemma-4 E4B on this benchmark. Qwen3.5 models are developed by Alibaba and have demonstrated strong multilingual and visual understanding, including CJK script recognition. GLM-4.6V Flash from Zhipu AI is another viable regional alternative. Both run efficiently on consumer hardware and are available via Hugging Face and local inference stacks.
Action Item This Week
If you have an existing or planned vision pipeline using Gemma-4 E4B, run at least 20 representative tasks from your actual use case against Qwen3.5-4B using the same llama.cpp or Transformers backend before committing to deployment — the performance gap appears consistent across inference frameworks, not just a quantization issue.