DeepSeek's Visual Primitives: Multimodal Reasoning From Seeing to Pointing

DeepSeek, Peking University, and Tsinghua University released a new multimodal reasoning framework this week, only to swiftly delete the code repository — we clearly see a significant gap between academic exploration and productization.

What this is

This research, called "Thinking with Visual Primitives," centers on transforming coordinate points and bounding boxes (i.e., the boxes marking object locations in images) into the "minimal units of thought" during the model's reasoning process.

Most existing multimodal AI systems look at images by "answering after looking": they first encode the image content, then reason and output using text. The new framework enables models to "point while looking": at each step of reasoning (i.e., step-by-step thinking to solve a problem), the model can look back and annotate specific locations on the image, similar to a person pointing at a route on a map while thinking.

But the crucial detail is: DeepSeek deleted the code repository shortly after releasing it. Open-sourcing and then retracting is highly unusual in both academia and engineering circles.

Industry view

We see the value of this research in its attempt to break through a current bottleneck in multimodal AI: the low participation of visual information during the reasoning process. Existing models often process images early on and rarely revisit image details in subsequent reasoning. Embedding spatial markers into the chain of thought theoretically improves the model's ability to handle tasks requiring fine-grained visual reasoning, such as geometric proofs and medical imaging.

However, we also note clear skepticism. First, the repository deletion itself is a signal — it may imply issues with the method's reproducibility or scalability, or it could involve undisclosed compliance considerations. Second, this "visual thinking" approach significantly increases computational overhead; processing extra spatial markers at every reasoning step may make commercial deployment costs prohibitive. Some community commentators noted that similar ideas were explored in 2023 research, and DeepSeek's incremental contribution requires more rigorous controlled experiments for verification.

Impact on regular people

For enterprise IT: We view this as a probing of multimodal capability boundaries. It does not change technology selection in the short term, but it is worth monitoring the application potential of visual reasoning in scenarios like quality inspection and remote sensing analysis.

For the personal workplace: AI's visual capabilities are shifting from "recognition" to "reasoning," but this transition takes time, so there is no need for radical adjustments to existing workflows.

For the consumer market: The direct impact is limited. This type of technology is still far from consumer-facing products and is more likely to land first in specialized fields like medical imaging and autonomous driving.

DeepSeek's Visual Primitives: Multimodal Reasoning From Seeing to Pointing

What this is

Industry view

Impact on regular people

Related Reading

AI Keeps Forg etting Half Your Docs? DeepSeek Now Reads a Full Book at Once

Compiling a Calculator Into AI Weights: A New Path to Decode Transformers

AI Price Discrim ination : Maryland Ban Warning for Small Teams

" AI Will Replace You " Anxiety ? I W oke Up : They 're Harvest ing Panic

Time to Switch AI Assist ants : Claude Quiet ly O vert akes Chat G PT

Sc attered AI Bills ? Open AI Hits AWS Bed rock