I’ve been experimenting with using multimodal LLMs for bounding box detection, and ran into the same issue many people here have described: models often return arbitrary or inconsistent coordinates.
As a workaround, I tried a different approach: instead of asking the model for raw pixel coordinates, I divide the image into a grid and let the LLM reason in terms of grid cells (e.g. row/column indices). These grid indices are then mapped back into pixel coordinates.
This “grid-indexed” method doesn’t solve everything, but it seems to reduce randomness and makes outputs more stable across providers (OpenAI, Anthropic, Gemini, etc.). It’s lightweight — just a single JS file + example HTML demo.
As a workaround, I tried a different approach: instead of asking the model for raw pixel coordinates, I divide the image into a grid and let the LLM reason in terms of grid cells (e.g. row/column indices). These grid indices are then mapped back into pixel coordinates.
This “grid-indexed” method doesn’t solve everything, but it seems to reduce randomness and makes outputs more stable across providers (OpenAI, Anthropic, Gemini, etc.). It’s lightweight — just a single JS file + example HTML demo.
Code and README are here: https://github.com/IntelligenzaArtificiale/GILM-Grid-Indexed...
I’d be curious if others have tried similar approaches, or if anyone has ideas on how to improve robustness of bounding box detection with LLMs.