Computer vision is solved if you let the model use tools

1 qasimWani 5 8/5/2025, 11:45:10 AM spatial-reasoning.com ↗

Comments (5)

qasimWani · 2h ago
i previously co-founded a synthetic data company, focused on fine-tuning diffusion models for robotics and manufacturing. the standard approach: generate better data, train smaller models, deploy. recently, reasoning models like o3, grok, and gemini began showing signs of strong spatial awareness. so i tested them on bounding box detection in complex scenes. they failed. badly.

but the reasoning trace showed impressive semantic understanding. the failure wasn’t conceptual. it came from tokenization and decoding limits. the models knew what they were seeing but couldn’t translate it into precise coordinates. (gemini 2.5 performs better because it uses an MoE with task-specific heads).

as such, i built a simple system that gives these models tools:

1. overlay a reference grid (inspired by Set of Marks, Microsoft 2023) to ground them visually

2. crop and zoom into regions of interest

3. call external detectors like Grounding DINO when helpful

with only prompting, this setup enables zero-shot object detection on tasks that traditional vision models fail. for example, detecting the barely visible YC logo on this person's jacket from a linkedin feed screenshot is only possible once you zoom into the right regions [https://www.spatial-reasoning.com/share/45dfaeaa-e5a1-4a8c-a...]

demo here: [spatial-reasoning.com] open-source code: [https://github.com/QasimWani/spatial-reasoning]

curious to hear thoughts. still exploring edge cases and failure modes. might write a more detailed blog if there’s interest.

qasimWani · 2h ago
another harder example: detecting a street sign on market st in sf that only becomes findable after multiple zoom-ins [https://www.spatial-reasoning.com/share/d7bab348-3389-41c7-9...]

one interesting pattern: forcing the model to keep its reasoning chain internal (i.e., no verbose "think step-by-step") actually improves accuracy. it seems to reduce hallucinations and overcorrections. still working on a clearer theory, but shorter chains seem to preserve spatial focus better.

curious how others think tool use like this could generalize.

also open to any references on visual grounding in LMMs. feels like a strangely underexplored space.

sota_pop · 1h ago
I’ve always felt CNNs are much more natural for visual analysis. It’s funny/unfortunate that transformers work SO well that their performance CAN rival CNNs, but it takes so much more work/processing power/model size. CNNs just feel like a more ergonomic fit to the problem (to me), but my experience is rooted in studying DL from when GANs were all the rage and “Attention Is All You Need” was a brand new paper, and admittedly, I need to brush up on my ViT theory.
qasimWani · 1h ago
yeah having that convolution prior is definitely useful when you're dealing with limited amount of data, because you're encoding problem structure into the model, which is why they get away with being trained on fewer samples but with a trade off around generalization.

but i think this moment is quite different because instead of baking everything in the latent space for these models, you're letting them reason how a human would - if i was asked to detect for the street sign i'd first start by zooming into different regions and iteratively figure out what is relevant. Yolo and other models don't do this well enough because they lack the language component which is a must have for complex reasoning like this for example: https://www.spatial-reasoning.com/share/2d4a8827-b227-4f23-a....

Like 4o can't do this even though it most likely has the same vision encoder as o4. this is the power of reasoning.

sota_pop · 1h ago
Isn’t this (subdividing into regions and analyzing each region within the context of the overall image) - essentially - the methodology of the YOLO algorithm?