Experiment: Grid-Indexed Localization with LLMs for Bounding Boxes

Comments (1)

iaitalia · 5h ago

I’ve been experimenting with using multimodal LLMs for bounding box detection, and ran into the same issue many people here have described: models often return arbitrary or inconsistent coordinates.

As a workaround, I tried a different approach: instead of asking the model for raw pixel coordinates, I divide the image into a grid and let the LLM reason in terms of grid cells (e.g. row/column indices). These grid indices are then mapped back into pixel coordinates.

This “grid-indexed” method doesn’t solve everything, but it seems to reduce randomness and makes outputs more stable across providers (OpenAI, Anthropic, Gemini, etc.). It’s lightweight — just a single JS file + example HTML demo.

Code and README are here: https://github.com/IntelligenzaArtificiale/GILM-Grid-Indexed...

I’d be curious if others have tried similar approaches, or if anyone has ideas on how to improve robustness of bounding box detection with LLMs.

Claude for Chrome (anthropic.com)

Gemini 2.5 Flash Image (developers.googleblog.com)

Dissecting the Apple M1 GPU, the end (rosenzweig.io)

GNU Artanis – A fast web application framework for Scheme (artanis.dev)

Rv, a new kind of Ruby management tool (andre.arko.net)

Chinese astronauts make rocket fuel and oxygen in space (livescience.com)

Undisclosed financial conflicts of interest in DSM-5 (2024) (bmj.com)

One universal antiviral to rule them all? (cuimc.columbia.edu)

Neuralink 'Participant 1' says his life has changed (fortune.com)

Japan has opened its first osmotic power plant (theguardian.com)

US Intel (stratechery.com)

SpaCy: Industrial-Strength Natural Language Processing (NLP) in Python (github.com)

Why do people keep writing about the imaginary compound Cr2Gr2Te6? (righto.com)

Starship's Tenth Flight Test (spacex.com)

Michigan Supreme Court: Unrestricted phone searches violate Fourth Amendment (reclaimthenet.org)

Game Theory at Work: When to Talk and When to Shut Up (swaits.com)

Cascata delle Marmore (en.wikipedia.org)

LiteLLM (YC W23) is hiring a back end engineer (ycombinator.com)

Show HN: Async – Claude code and Linear and GitHub PRs in one opinionated tool (github.com)

Connecting M.2 drives to various things (and not doing so) (utcc.utoronto.ca)

The "Wow!" signal was likely from extraterrestrial source, and more powerful (iflscience.com)

"Special register groups" invaded computer dictionaries for decades (2019) (righto.com)

How to store weather forecast data for fast time-series APIs (2022) (openmeteo.substack.com)

iOS 18.6.1 0-click RCE POC (github.com)

Proposal: AI Content Disclosure Header (ietf.org)

The McPhee method for writing deeply reported nonfiction (jsomers.net)

China's Guowang megaconstellation is more than another version of Starlink (arstechnica.com)

Show HN: A zoomable, searchable archive of BYTE magazine (byte.tsundoku.io)

Das Problem mit German Strings (polarsignals.com)

Proposal to Ban Ghost Jobs (cnbc.com)

Framework Laptop 16 (frame.work)

Stop Talking to Technology Executives Like They Have Anything to Say (stilldrinking.org)

DSLRoot, proxies, and the threat of 'legal botnets' (krebsonsecurity.com)

All the world’s polygons (sum.si)

What happens when ambassadors are summoned by the host country? (politics.stackexchange.com)

Why I'm declining your AI generated MR (blog.stuartspence.ca)

Interactive map of Paul's first century travels in Roman world (intofarlands.com)

Deeper Than Deep: David Reich's genetics lab unveils our prehistoric past (2017) (laphamsquarterly.org)

Show HN: Smooth – Faster, cheaper browser agent API (smooth.sh)

Cornell's world-first 'microwave brain' computes differently (newatlas.com)

Show HN: Diggit.dev – Git history for architecture archaeologists (diggit.dev)

Show HN: Turn Markdown into React/Svelte/Vue UI at runtime, zero build step (markdown-ui.com)

Eyecam (marcteyssier.com)

The TTY Demystified (2008) (linusakesson.net)

Show HN: Sideko – Hybrid deterministic/LLM generator for API SDKs and docs (github.com)

The Relativity of Wrong (1988) (hermiene.net)

YouTube's sneaky AI 'experiment' (theatlantic.com)

Google is killing first and second gen Nest Thermostats (support.google.com)

Ask HN: Why hasn't x86 caught up with Apple M series?

Show HN: I integrated my from-scratch TCP/IP stack into the xv6-riscv OS (github.com)

Experiment: Grid-Indexed Localization with LLMs for Bounding Boxes

Comments (1)