DeepSeek Multimodal Test: Instant OCR, Fails Color Blind Cards, Coding Wins Most

What this is

DeepSeek launched multimodal capabilities in a gray test this week—enabling the LLM to "see and describe images." Currently available to some users, it is not yet fully rolled out.

We compiled the core conclusions from our testing:

Speed is the biggest highlight. In non-thinking mode, general image recognition and text OCR (Optical Character Recognition, i.e., extracting text from images) are nearly instant. It even recognizes artistic fonts and the cursive script of the Lantingji Xu, far exceeding speed expectations.

Chart and table parsing are accurate. Numbers verify correctly, and structures are restored accurately.

However, color blind card recognition failed in both modes. Color blind cards require "seeing" the hidden numbers in the image before answering; the model skips perception and goes straight to reasoning, guaranteeing errors. This shows DeepSeek's underlying visual encoding still has blind spots that cannot be compensated for by "thinking longer."

Image-to-HTML works but lacks refinement. The gap between the two modes is small; it can complete the 0→1 skeleton, but details like spacing and font restoration still lag a generation behind UI-proficient models like Gemini.

Industry view

Community excitement focuses on two dimensions: speed and unlocking coding workflows. Previously, DeepSeek lacked vision capabilities and couldn't read screenshots or design files, a glaring shortcoming in coding scenarios. With this filled, the end-to-end development assistance pipeline is finally connected—which is why multiple testers consider it "more important than V4."

However, the color blind card failure is no edge case. It reveals DeepSeek multimodality's core characteristic: strong at "image-based reasoning," weak at "truly understanding images." When tasks require visual perception rather than logical deduction, the model exposes its flaws. As one tester bluntly summarized: reasoning is the strength; visual encoding itself remains the weakness.

Furthermore, the image-to-HTML results show that in scenarios requiring pixel-perfect restoration, DeepSeek still trails the first tier. Expecting it to directly produce deliverable frontend code is unfeasible at this stage.

Impact on regular people

For enterprise IT: With DeepSeek multimodality filled, internal scenarios like document OCR and report parsing can reduce reliance on external APIs, improving the completeness of self-built solutions.

For individual professionals: For lightweight needs like screenshot Q&A or academic chart interpretation, the response speed in non-thinking mode offers a great experience, suitable for high-frequency, rapid interactions.

For the consumer market: Most people still can't access it during the gray test phase, and basic visual flaws limit its deployment in high-precision scenarios like medical imaging and quality inspection. In the short term, it remains "good enough but not amazing."

DeepSeek Multimodal Test: Instant OCR, Fails Color Blind Cards, Coding Wins Most

What this is

Industry view

Impact on regular people

Related Reading

DeepSeek's Visual Primitives: Multimodal Reasoning From Seeing to Pointing

AI Keeps Forg etting Half Your Docs? DeepSeek Now Reads a Full Book at Once

Gemma 4 Beats Qwen 3.6 With 1/5 The Tokens — Local AI Era Demands Efficiency

Zig Founder: AI Code Has a "Digital Smell" — Open Source Raises Defenses

Microsoft Red Teams 100 AI Agents: Single Safety ≠ Network Security

AMD's 128GB Halo Box Prototype Challenges Apple Mac's Local LLM Dominance