O3 beats a master-level GeoGuessr player, even with fake EXIF data

429 bko 294 4/29/2025, 4:48:42 PM sampatt.com ↗

Comments (294)

rafram · 23h ago
From one of o3 outputs:

> Rear window decal clearly reads “www.taxilinder.at”. A quick lookup shows Taxi Linder GmbH is based in Dornbirn, Vorarlberg.

That's cheating. If it can use web search, it isn't playing fair. Obviously you can get a perfect score on any urban GeoGuessr round by looking up a couple businesses, but that isn't the point.

SamPatt · 21h ago
Author here - it's a fair criticism, and I point it out in the article. However, I kept it in for a few reasons.

I'm trying to show the model's full capabilities for image location generally, not just playing geoguessr specifically. The ability to combine web search with image recognition, iteratively, is powerful.

Also, the web search was only meaningful in the Austria round. It did use it in the Ireland round too, but as you can see by the search terms it used, it already knew the road solely from image recognition.

It beat me in the Colombia round without search at all.

It's worthwhile to do a proper apples and apples comparison - I'll run it again and update the post. But the point was to show how incredibly capable the model is generally, and the lack of search won't change that. Just read the chain of thought, it's incredible!

LeifCarrotson · 20h ago
There's some level at which an AI 'player' goes from being competitive with a human player, matching better-trained human strategy against a more impressive memory, to just a cheaty computer with too much memorization. Finding that limit is the interesting thing about this analysis, IMO!

It's not interesting playing chess against Stockfish 17, even for high-level GMs. It's alien and just crushes every human. Writing down an analysis to 20 move depth, following some lines to 30 or more, would be cheating for humans. It would take way too long (exceeding any time controls and more importantly exceeding the lifetime of the human), a powerful computer can just crunch it in seconds. Referencing a tablebase of endgames for 7 pieces would also be cheating, memorizing 7 terabytes of bitwise layouts is absurd but the computer just stores that on its hard drive.

Human geoguessr players have impressive memories way above baseline with respect to regional infrastructure, geography, trees, road signs, written language, and other details. Likewise, human Jeopardy players know an awful lot of trivia. Once you get to something like Scrabble or chess, it's less and less about knowing words or knowing moves, but more about synthesizing that knowledge intelligently.

One would expect a human to recognize some domain names like, I don't know, osu.edu: lots of people know that's Ohio State University, one of the biggest schools in the US, located in Columbus, Ohio. They don't have to cheat and go to an external resource. One would expect a human (a top human player, at least) to know that taxilinder.at is based in Austria. One would never expect any human to have every business or domain name memorized.

With modern AI models trained on internet data, searching the internet is not that different from querying its own training data.

kirrent · 15h ago
As another example you can consider the apparently successful DOTA2 and Starcraft 2 bots. They'd be interesting if they taught us new ideas about the games in the same way that AlphaGo's God move uncovered something new about Go. But they didn't. They excelled through superior micro and flawless execution of quite simple strategies. Watching pros trying to hold off waves of perfectly microed blink stalkers reminded me of seeing a chess engine in action. A computer grinding down their doomed human opponent using the advantages offered by being a computer rather than superior human-like play.
grogenaut · 12h ago
I'm pretty sure that the bots changed the dieback meta around the last TI in seattle when openai last did their demo pre canada TI. So I disagree that the "ai taught us nothing". Prior to that dieback was seen bad. After that people did the math and realized that spam respawn, the money and growth matter more. They may have altered the game after that, I don't know. I only paid attention when it was at Climate Pledge / Key.
Ntrails · 8h ago
The AI's play meaningfully added ideas of ways to play dota2 iirc. It wasn't just buying back, the way they played around early advantage hyper aggressive, not much farming, spam buying regen to stay out etc.

On the other hand you could generally beat the first "1v1 mid" bot by just cutting the wave behind its tower. So adaptation to new stuff was not good in isolation.

I would have loved to know whether given more time/prep/replays/practice pros would have figured out the holes. My guess is yes

somenameforme · 10h ago
Your assessment of computer chess could use a bit of elaboration. A strong human can easily play an entire game blindfolded - even in blitz/high speed time controls. So seeing a line 30 moves out is not especially remarkable. What makes computers so unbelievably strong in chess is much like in other domains, and it's pretty boring - they will literally never make a simple oversight or blunder. Even the best human players regularly make "simple" mistakes even on the current move, let alone in one's distant analyses.

So 98% of the moves a computer will play are not especially surprising at all. A strong human will just about always have at least considered the move and even if not - they'll immediately understand the point. And in the 2% there's a relatively simple explanation. Computer's inability to make short-term mistakes lets them consider ideas humans never would. For instance humans tend to like material, yet there are a shockingly large number of positions where a modern computer will sac a piece and then just continue on playing a piece down in what "feels" like a fairly normal position. It simply turns out that your opponent simply has no way to convert their material advantage, and so your positional advantage will tell in the longrun, even being a piece down! At least if you're a computer...

This has led to some interesting outcomes. For instance Fabiano Caruana, a top 10 player in the world, is extremely well known for his exceptional level of deep and creative opening preparation, all computer approved. But in more than a few instances he's ended up in positions that look bad but where a computer will say he's practically winning, and ultimately go on to lose the game. It's simply because these sort of positions might indeed be objectively winning, but it may require 10 or 20 practically perfect moves - whereas a single subtle mistake means you lose. And it's extremely hard for even the best players in the world to play like this.

umanwizard · 10h ago
> A strong human can easily play an entire game blindfolded - even in blitz/high speed time controls. So seeing a line 30 moves out is not especially remarkable.

How are these points connected? Playing blindfolded doesn’t require being able to calculate 30 moves deep (or any particular number).

Being able to remember/visualize an N move sequence without losing the thread while blindfolded is not at all the same thing as being able to calculate N moves deep.

somenameforme · 8h ago
I assume you mean because when a human is calculating some variation 30 moves deep that we're obviously discarding a ridiculous chunk of the overall game tree possibilities? Absolutely true, but the same is also true of computers. For instance I just let Stockfish 17 run on the starting position until it got to a reported depth of 30. It took almost exactly 10 seconds while running at ~3.2 million nodes per second. So it assessed about 32 million positions to get to a reported depth of 30 (which is 15 moves for each side), but there's at least something like 8e41 possible positions there (that's assuming a low average of 25 possible moves per position). So it's discarding a percent of moves that pretty safely rounds up to 100%.

Another example to illustrate the point is the ICCF (International Correspondence Chess Federation). Were computers comparably competent at long-term play as they are at short-term, then there wouldn't even be a competition. It'd simply be who has the strongest computer. But in reality that seems to play no particularly decisive factor. For instance, as in "normal" chess, there remains a huge gender divide in ratings, yet females certainly have no less access to competent hardware than males.

emporas · 15h ago
If all internet data could be saved in a disk, alongside with model weights, then what's the difference of pulling the knowledge out of the weights exclusively, versus weights and jpeg images? I don't see any difference.

The only difference might be compression, model weights throw away the noise and save the signal only.

For me humans versus machines is not an interesting competition. Machines will always win in a narrow specialized domain.

A more interesting competition is a very experienced human, versus an amateur who knows how to use A.I. Statistical/probabilistic models get confused, and they can easily wander aimlessly into rabbit holes. But a human who knows how to control the A.I. but is amateur at that particular narrow task, could guide it and at the end perform the same, or even better than the more experienced person.

In chess that's not true due to the super narrowed down domain of 64 squares and 6 different pieces, but anything that is more general a natty intelligence is necessary.

When I use it for programming, I never ask it to write code, i guide it to write that function, and use that other function from a library. If it is let free to guess, it will guess correctly 90% to 99% of the time, but if it is instructed then the code is almost flawless, nine nines percent of accuracy.

vlovich123 · 18h ago
> It's not interesting playing chess against Magnus, even for high-level GMs. He just crushes almost every human

The differences even among humans between the absolute best & those out of the top 10 tend to be pretty drastic. And a non-IM against Magnus won't even understand what's going on. You could similarly claim that Magnus just memorized a bunch of openings which is similar to criticism GMs level too which is why Chess960 is now gaining more traction. My point is that there's not really such a thing as "fair" in a competition.

Re geoguessr, why not let them whatever tools are available? I have similar critiques about bike racing & restrictions on the technology advancements they can put on the bike. But every competition chooses arbitrary lines to draw which compose the rules so it doesn't really matter.

sensanaty · 17h ago
I mean Geoguessr explicitly states when you launch the game (in PvP mode) that googling/searching is bannable
ralfd · 16h ago
Geoguessr is a game with artificial rules though. If I want the AI to solve a task I care about the result, not what tools it uses.
vlovich123 · 15h ago
That’s exactly my point. Evaluating the task success independent of artificial limitations that are specific to the game doesn’t invalidate the result.
tshaddox · 19h ago
> There's some level at which an AI 'player' goes from being competitive with a human player, matching better-trained human strategy against a more impressive memory, to just a cheaty computer with too much memorization. Finding that limit is the interesting thing about this analysis, IMO!

And a lot of human competitions aren't designed in such a way that the competition even makes sense with "AI." A lot of video games make this pretty obvious. It's relatively simple to build an aimbot in a first-person shooter that can outperform the most skilled humans. Even in ostensibly strategic games like Starcraft, bots can micro in ways that are blatantly impossible for humans and which don't really feel like an impressive display of Starcraft skill.

Another great example was IBM Watson playing Jeopardy! back in 2011. We were supposed to be impressed with Watson's natural language capabilities, but if you know anything about high-level Jeopardy! then you know that all you were really seeing is that robots have better reflexes than humans, which is hardly impressive.

mrlongroots · 19h ago
To reframe your takeaway: you want to benchmark the "system" and see how capable it is. The boundaries of the system are somewhat arbitrary: is it "AI + web" or "only AI", and it is not about fairness as much as about "what do you, the evaluator, want to know".
rowanG077 · 18h ago
You seem indicate you want a computer to beat a human without ever using what a computer is actually good at(large memories, brute force compute etc). That seems a little ridiculous to me. How do you want it to engage? Disallowed use of native compute and must simulate a full human brain?

Sure I do agree that the web search is too far, because it's literally cheating. But stockfish is super human at chess, it doesn't really matter that it can do this by leveraging the strengths of a computer.

monadINtop · 17h ago
I disagree, if we're gonna be hyping up machines for their prowess at "thinking" and being artificially "intelligent" in that soft effusive human way then yeah I think its fair criticism. We already knew from the 50s that computers are like stupid geniuses when it comes to following algorithms and crunching computations far too expansive and tedious for any human.
rowanG077 · 16h ago
The point is that from a black box view they are rapidly surpassing humans in a lot of fields. You can say they do it with tools the human mind has no access to. That's probably true. The "soft effusive human way" to be intelligent is also black box, and something we aren't even close to understanding. This means it's as close to be able to be measured as string theory. "If it's not exactly like this thing we don't understand it's not fair".
chongli · 15h ago
They're not a black box though. They're querying an external resource (Google Search). That's crossing an API boundary. If you're going to let them use Google Search then let the human opponent use Google Search as well.

It's like if you were building an AI robot to run a marathon against a human opponent, except you let the AI robot ride a motorcycle and force the human to stay on foot.

SamPatt · 14h ago
Search was irrelevant in this case. I ran it again without search and it made the same guesses. I updated the post with those details.
rowanG077 · 14h ago
I didn't say the AI is black box, I said if you take a black box view. That last word is load bearing.

Did you read the article? It's clearly shown that with or without search it doesn't make much of a difference how good it actually is.

k4rli · 21h ago
It's still as much cheating as googling. Completely irrelevant. Even if it were to beat Blinky, it's not different from googlers/scripters.
SamPatt · 19h ago
I disagree. I ran those rounds again, without search this time, and the results were nearly identical:

https://news.ycombinator.com/item?id=43837832

IanCal · 20h ago
I tried the image without search and it talked about Dornbirn anyway but ended up choosing Bezau which is really quite close.

edit - the models are also at a disadvantage in a way too, they don't have a map to look at while the pick the location.

SamPatt · 18h ago
Yes, I re-ran those rounds and it made the same guesses without search, within 1km I believe.

You're right about not having a map - I cannot imagine trying to line up the Ireland coast round without referencing the map.

Ukv · 23h ago
The author did specifically point out that

> Using Google during rounds is technically cheating - I’m unsure about visiting domains you find during the rounds though. It certainly violates the spirit of the game, but it also shows the models are smart enough to use whatever information they can to win.

and had noted in the methodology that

> Browsing/tools — o3 had normal web access enabled.

Still an interesting result - maybe more accurate to say O3+Search beats a human, but could also consider the search index/cache to just be a part of the system being tested.

godelski · 21h ago
Pointing out that it is cheating doesn't excuse the lie in the headline. That just makes it bait and switch, a form of fraud. OP knew they were doing a bait and switch.

I remember when we were all pissed about clickbait headlines because they were deceptive. Did we just stop caring?

sdenton4 · 21h ago
The people pissed about clickbait headlines were often overstating things to drum up outrage and accumulate more hacker news upboats...
godelski · 19h ago
I'm not sure why you're defending clickbait. It is just fraud. I'm not sure why we pretend it is anything different.

Sure, people made overblown claims about the effects, but that doesn't justify fraud. A little fraud is less bad than major fraud, but that doesn't mean it isn't bad.

sdenton4 · 15h ago
On the one hand, you have SEO mills churning out crap and A/B testing clickthrough rates on different headline/image combinations. That's bad.

On the other hand, you have a blogger choosing a headline for a cool thing that they did and wrote up...

The author here writes up what happens. They include ample discussion of search in their write-up. They do not need to write the entire blog post in the title in order to avoid 'fraud.' Yeesh.

godelski · 14h ago
IDK what SEO mills and all that have to do with any of this. What other people do doesn't matter. It's something is bad then it doesn't make it not bad because other people do it and do it worse. There's no logic in that framing.

I'm not sure who you think is a fool, me or you. But either way, I don't find your rhetoric acceptable. I explained why I think the title significantly diverges from the content of the article. You're welcome to disagree but that argument will have nothing to do with SEO mills. It's insulting you'd think I'd accept such a silly retort. We're not comparing here, we're categorizing.

sdenton4 · 2h ago
In short, the title of the piece is in no way fraudulent: No one is being defrauded here, and I don't believe there was any intent to defraud anyone.

The title is not click bait. Titles might be better or worse for their content, but imperfection is not fraud.

627467 · 20h ago
Cheating implies there's a game. There isn't.

> Titles and headlines grab attention, summarize content, and entice readers to engage with the material

I'm sorry you felt defrauded instead. To me the title was very good at conveying to me the ability of o3 in geolocating photos.

hatthew · 20h ago
Title says o3 beat a [human] player. That implies there is some competition that has the capacity to be fair or unfair.
tshaddox · 19h ago
Sure, but o3 is itself already an online service backed by an enormous data set, so regardless of whether it also searched the web, it's clearly not literally "playing fair" against a human.
godelski · 19h ago
But it still bounds the competition. OP is skilled in the domain. I'm not, so if I wrote a post about how O3 beat me you'd be saying how mundane of a result it is. I mean, I suck at Geoguesser. Beating me isn't impressive. This is also a bound
sebzim4500 · 20h ago
Presumably being an AI is technically against the GeoGuessr rules so I don't see how there can be an expectation that those rules were followed.
bahmboo · 21h ago
The headline said the AI beat him, it did not say it beat him in a GeoGuessr game. The article clearly states what he did and why.
SecretDreams · 21h ago
What's your suggestion for an alternative headline?
godelski · 18h ago

  Can O3 Beat a Master-Level GeoGuessr?
  How Good is O3 at GeoGuessr?
  EXIF Does Not Explain O3's GeoGuessr's Performance
  O3 Plays GeoGuessr (EXIF Removed)
But honestly, OP had the foresight to remove EXIF data and memory from O3 to reduce contamination. The goal of the blog post was to show that O3 wasn't cheating. So by including search, they undermine the whole point of the post.

The problem really stems from the lack of foresight. Lack of misunderstanding the critiques they sought to address in the first place. A good engineer understands that when their users/customers/<whatever> makes a critique, that what the gripe is about may not be properly expressed. You have to interpret your users complaints. Here, the complaint was "cheating", not "EXIF" per se. The EXIF complaints were just a guess at the mechanism in which it was cheating. But the complaint was still about cheating.

SamPatt · 18h ago
>The goal of the blog post was to show that O3 wasn't cheating.

No, the goal of the post was to show that o3 has incredible geolocation abilities. It's through the lens of a Geoguessr player who has experience doing geolocation, and my perspective on whether the chain of thought is genuine or nonsense.

In Simon's original post, people were claiming that o3 doesn't have those capabilities, and we were fooled by a chain of thought that was just rationalizing the EXIF data. It only had the _appearance_ of capability.

The ability to perform web search doesn't undermine the claim that o3 has incredible geolocation abilities, because it still needs to have an underlying capability in order to know what to search. That's not true for simply reading EXIF data.

This is the best way I knew to show that the models are doing something really neat. Disagreements over the exact wording of my blog post title seem to be missing the point.

godelski · 17h ago

  > No, the goal of the post was to 
I think you misinterpret my point. The goal of your post is distinct from how people will interpret it. Plenty of times people intend one thing and get a different thing. That's life.

  > In Simon's original post, people were claiming that o3 doesn't have those capabilities, and we were fooled by a chain of thought that was just rationalizing the EXIF data. It only had the _appearance_ of capability.
And this is the key part!

The people questioning O3's capabilities were concerned with cheating. Any mention of EXIF is a guess as to how it was cheating, but the suspicion is still that it is cheating. That's the critique!

If you framed the title as "O3 Does Not Need EXIF Data To Beat A Master-Level GeoGuessr" then I wouldn't have made my comment. The claim is much more specific and reflects the results of your post. You did in fact show that it doesn't need EXIF data to do what it does! BUT by framing it as "Beats a Master-Level" there is an implicit claim that both of you are playing the same game. The fact that you weren't is the issue.

Look at it this way. If I said I beat Tiger Woods at golf and then casually slipped in that I was playing with a handicap, wouldn't you feel a bit lied to? You'd think "Did Godelski really beat Tiger Woods?", and you would mean without the handicap. You'd have every right to be suspicious! And you'd have every right to dismiss me.

Most importantly, take a second here. My whole point is that you can make a much stronger claim! One where there wouldn't be a significant divergence between title and content. I get that it is frustrating to receive criticism, but even if you believe I'm wrong to do so, is it not more effective to show me up by just redoing without search? If you do that, then you only end up with a stronger claim. But by disagreeing and arguing here you're just not convincing me. Even if you disagree with my interpretation of the title, you know full well that it is a valid interpretation. Given the pushback from other comments I think you can't deny that it isn't an unexpected one. So the only way to resolve this is to either change the title or change the data. Besides, you responded to the top comment about how it was a fair criticism. All I've done is explain why the criticism was made in the first place!

And yes, it still undermines the result. Because that is entirely dependent on the (interpretation of the) claim that was made. Your results are still valid, but they only satisfy a weaker claim.

FWIW, I think the updated post is better. My comment here would only be that you could add clarity by showing the non-search scores (especially in the final table). In fact, the "study" being done with and without search makes a stronger post than had it only been one way. So kudos!

SamPatt · 15h ago
You've clearly thought this through, and I agree that had I been more precise at the start it would have avoided some confusion. I'm glad you like the updated post.
jasonlotito · 20h ago
One of the rules is banning the use of third-party software or scripts.

Any LLM attempting to play will lose because of that rule. So, if you know the rules, and you strictly adhere to them (as you seem to be doing) than no need to click on the link. You already know it's not playing buy GeoGuesser rules.

That being said, if you are running a test, you are free to set the rules as you see fit and explain so, and under the conditions set by the person running the test, these are the results.

> Did we just stop caring?

We stopped caring about pedantry. Especially when the person being pedantic seems to cherry pick to make their point.

godelski · 18h ago
This doesn't mean you shouldn't try to make things as far as possible. Yes, it would still technically violate rules, but don't pretend like this is binary.

  > We stopped caring about pedantry
Did we? You see to be responding to my pedantic comment with a pedantic comment.
kenjackson · 20h ago
Technically the LLM is 3rd party software so the use of it is cheating. QED
_heimdall · 22h ago
This seems like a great example of why some are so concerned with AI alignment.

The game rules were ambiguous and the LLM did what it needed to (and was allowed to) to win. It probably is against the spirit of the game to look things up online at all but no one thought to define that rule beforehand.

umanwizard · 22h ago
No, the game rules aren't ambiguous. This is 100% unambiguously cheating. From the list of things that are definitely considered cheating in the rules:

> using Google or other external sources of information as assistance during play.

The contents of URLs found during play is clearly an external source of information.

_heimdall · 14h ago
> > Using Google during rounds is technically cheating - I’m unsure about visiting domains you find during the rounds though. It certainly violates the spirit of the game, but it also shows the models are smart enough to use whatever information they can to win

Going off of the source article here, the author at least wasn't clear on whether the rules only prevent using google or if visiting any website is against the rules.

And either way, my point was that the person defining the rules to the LLM was ambiguous. The potential risk of misalignment isn't that a perfect set of rules can't be defined, its that the rules we do define will always be incomplete.

tshaddox · 19h ago
o3 already is an external source of information. It's an online service backed by an enormous model generated from an even more enormous corpus of text via an enormous amount of computing power.
Filligree · 16h ago
o3 was the thing beating GeoGuessr. It isn't external to itself.
tshaddox · 10h ago
Sure, then o3 plus the World Wide Web was the thing playing the game, and also isn’t external to itself.
umanwizard · 8h ago
Right, and that’s indeed impressive! But still not what’s claimed in the headline.
GaggiX · 21h ago
I believed the rules were not explained to the model so it does use what it can.
misnome · 21h ago
Then you can 100% not claim it is “Playing” the game
ben_w · 20h ago
That right there *is the alignment problem*.

If I task an AI with "peace on earth" and the solution the AI comes up with is ripped from The X-Files* and it kills everyone, it isn't good enough to say "that's cheating" or "that's not what I meant".

* https://en.wikipedia.org/wiki/Je_Souhaite

_heimdall · 14h ago
That's the alignment problem. We intended a certain set of rules but didn't define them completely, or there were conditions we didn't consider.

An AI wouldn't have to maliciously break a rule to go wrong. The point is that the system could so exactly what it was supposed to do, it plays within the given rules but the outcomes aren't what we wanted or intended.

GaggiX · 20h ago
It's playing a game in which the rules are a bit ambiguous if not explained.
_heimdall · 14h ago
And in reality the set of rules we would need can never be fully explained.

Alignment is the goal of having an AI system understand what we would want it to do even when the rules weren't predefined. That's an impossible task, or rather its seemingly impossible and we don't yet know how to do it.

spookie · 22h ago
A human can also use the same tools if it wasn't for the rules or fair play. They should've simply redone the test.
ceph_ · 22h ago
The AI should be forced to use the same rules as the human. Not the other way around. The AI shouldn't be using outside resources.
voxic11 · 21h ago
Another rule bans "using third-party software or scripts in order to gain an unfair advantage over other players."

So is it even possible for O3 to beat another player while complying with the rules?

ben_w · 20h ago
If a player uses such a model, the model is third-party and the player is cheating.

But: when a specific model is itself under test, I would say that during the test it becomes "first" (or second?) party rather than "third".

bscphil · 21h ago
I think that's part of the point they're making, hence "They should've simply redone the test."
krferriter · 22h ago
An AI being better than a human at doing a google search and then skimming a bunch of pages to find location-related terms isn't as interesting of a result.
inerte · 22h ago
How the heck is not? Computers are looking into screenshots and searching the internet to support their "thinking", that's amazing! Have we become so used to AI and what was impossible 6 months ago is shruggable today?

I've being doing this MIND-Dash diet lately and it's amazing I can just take a picture of whatever (nutritional info / ingredients are perfect for that) and just ask if it fits my plan and it tells me back into what bucket it falls, with detailed breakdown of macros in support of some additional goals I have (muscle building for powerlifting). It's amazing! And it does in 2 minutes passively what it would take me 5-10 active search.

ekidd · 21h ago
I fully expect that someday the news will announce, "The AI appears to be dismantling the moons of Jupiter and turning them into dense, exotic computational devices which it is launching into low solar orbit. We're not sure why. The AI refused to comment."

And someone will post, "Yeah, but that's just computer aideded design and manufacturing. It's not real AI."

The first rule of AI is that the goalposts always move. If a computer can do it, by definition, it isn't "real" AI. This will presumably continue to apply even as the Terminator kicks in the front door.

eru · 20h ago
Yes, but I choose to interpret that as a good thing. It is good that progress is so swift and steady that we can afford to keep moving the goalposts.

Take cars as a random example: progress there isn't fast enough that we keep moving the goalposts for eg fuel economy. (At least not nearly as much.) A car with great fuel economy 20 years ago is today considered at least still good in terms of fuel economy.

godelski · 21h ago
In the same way a calculator performing arithmetic faster than humans isn't impressive. The same way running regex over a million lines and the computer beating a human in search isn't impressive
ludwik · 21h ago
Neither is impressive solely because we've gotten used to them. Both were mind-blowing back in the day.

When it comes to AI - and LLMs in particular - there’s a large cohort of people who seem determined to jump straight from "impossible and will never happen in our lifetime" to "obvious and not impressive", without leaving any time to actually be impressed by the technological achievement. I find that pretty baffling.

godelski · 19h ago
I agree, but without removing search you cannot decouple. Has it embedded a regex method and is just leveraging that? Or is it doing something more? Yes, even the regex is still impressive but it is less impressive that doing something more complicated and understanding context and more depth.
rowanG077 · 18h ago
I think both are very impressive, world shattering capabilities. Just because they have become normalized doesn't make it any less impressive in my view.
godelski · 17h ago
That's a fair point, and I would even agree. Though I think we could agree that it is fair to interpret "impressive" in this context as "surprising". There's lots of really unsurprising things that are incredibly impressive. But I think the general usage of the word here is more akin to surprisal.
jug · 20h ago
Yeah, it's a funny take because this is in fact a more advanced form of AI with autonomous tool use that is just now emerging in 2025. You might say "They could search the web in 2024 too" but that wasn't autonomous on its own, but required telling so or checking a box. This one is piecing ideas together like "Wait, I should Google for this" and that is specifically a new feature for OpenAI o3 that wasn't even in o1.

While it isn't entirely in the spirit of GeoGuesser, it is a good test of the capabilities where being great at GeoGuesser in fact becomes the lesser news here. It will still be if disabling this feature.

SamPatt · 19h ago
That isn't what's happening though. I re-ran those two rounds, this time without search, and it changed nothing. I updated the post with details, you can verify it yourself.

Claiming the AI is just using Google is false and dismissing a truly incredible capability.

arandomhuman · 21h ago
But then they couldn't make a click bait title for the article.
mcv · 1h ago
I use web search and still can't get a perfect score. Without it, I wouldn't be able to find anything. (I don't play competitively obviously; and just my free 3 tries per day.)
silveraxe93 · 23h ago
Yeah, the author does note that in the article. He also points it out in the conclusion:

> If it’s using other information to arrive at the guess, then it’s not metadata from the files, but instead web search. It seems likely that in the Austria round, the web search was meaningful, since it mentioned the website named the town itself. It appeared less meaningful in the Ireland round. It was still very capable in the rounds without search.

rafram · 23h ago
Seems like they should've just repeated the test. But without the huge point lead from the rounds where it cheated, it wouldn't have looked very impressive at all.
silveraxe93 · 22h ago
People found the original post so impressive they were saying that it had to be coming from cheating by looking at EXIF data. The point of this article was to show it doesn't. It got an unfair advantage in 1 (and say 0.5) out of 5. With the non-search rounds still doing great.

If you think this is unimpressive, that's subjective so you're entitled to believe that. I think that's awesome.

godelski · 21h ago
Sorry, I think I misread you. I think you said

  People accused it of cheating by reading EXIF data. They were wrong, it cheated by using web search. That makes the people that accused it of cheating wrong and this post proves that. 
And is everyone forgetting that what OpenAI shows you during the CoT is not the full CoT? I don't think you can fully rely on that to make claims about when it did and didn't search
SamPatt · 21h ago
That's inaccurate. It beat me by 1,100 points, and given the chain of thought demonstrated that it knew the general region of both guesses before it employed search, it would likely have still beaten me in those rounds. Though probably by fewer points.

I will try it again without web search and update the post though. Still, if you read the chain of thought, it demonstrates remarkable capabilities in all the rounds. It only used search in 2/5 rounds.

godelski · 21h ago
I'd be interested at capabilities without web search. The displayed CoT isn't the full CoT so it's hard to know if it really is searching or not. I mean it isn't always obvious when it does. Plus, the things are known to lie ¯\_(ツ)_/¯
SamPatt · 21h ago
I do understand the skepticism, and I'll run it again without search to see what happens.

But a serious question for you: what would you need to see in order to be properly impressed? I ask because I made this post largely to push back on the idea that EXIF data matters and the models aren't that capable. Now the criticism moves to web search, even though it only mattered in one out of five rounds.

What would impress you?

mattmanser · 19h ago
You're kinda being your own worse enemy though.

"Technically cheating"? Why even add the "technically".

It just gives the impression that you're not really objectively looking for any smoke and mirrors by the AI.

SamPatt · 19h ago
I hear you - but I had already read through the chain of thought which identified the right region before search, and had already seen the capabilities in many other rounds. It was self-evident to me that the search wasn't an essential part of the model's capabilities by that point.

Which turned out to be true - I re-ran both of those rounds, without search this time, and the model's guesses were nearly identical. I updated the post with those details.

I feel like I did enough to prove that o3's geolocation abilities aren't smoke and mirrors, and I tried to be very transparent about it all too. Do you disagree? What more could I do to show this objectively?

godelski · 19h ago

  > What would impress you?
I want to be clear that you tainted the capacity to impress me by the clickbait title. I don't think it was through malice, but I hope you realize the title is deceptive.[0] (Even though I use strong language, I do want to clarify I don't think it is malice)

To paraphrase from my comment: if you oversell and under deliver, people feel cheated, even if the deliverable is revolutionary.

So I think you might have the wrong framing to achieve this goal. I am actually a bit impressed by O3's capabilities. But at the same time you set the bar high and didn't go over or meet it. So that's going to really hinder the ability to impress. On the other hand, you set the bar low, it usually becomes easy to. It i slike when you have low expectations for a movie and it's mediocre you still feel good, right?

[0] https://news.ycombinator.com/item?id=43836791

SamPatt · 19h ago
I did repeat the test without search, and updated the post. It made no difference. Details here:

https://news.ycombinator.com/item?id=43837832

clhodapp · 21h ago
The question is not only how much it helped the AI model but rather how much it would have helped the human.

This is because the AI model could have chosen to run a search whenever it wanted (e.g. perhaps if it knew how to leverage search better, it could have used it more).

In order for the results to be meaningful, the competitors have to play by the same rules.

sReinwald · 9h ago
While technically you're right about the rules of GeoGuessr as a game, I think this misses the more interesting and potentially concerning implications.

IMO, whether or not it's "cheating" at GeoGuessr is less important than what this capability represents: an AI system that can extract specific details from images, determine what's significant, and use that information to precisely locate a photo. That's impressive regardless of method.

The implications extend far beyond the game. Consider its potential use in automating OSINT tasks. An AI agent running o3 could analyze publicly posted photos (e.g., on Twitter, Instagram), identify locations from background details, and aggregate this data. Within an hour or two, this could build a detailed location profile (home, work, frequented places) for an individual - something that previously required considerable manual effort and expertise. While human analysis would likely still be needed, this significantly lowers the barrier for large-scale analysis.

This ability to extract location clues from images - whether shop names, signs, or landscapes - has significant real-world implications, particularly for automating OSINT tasks. The potential for large-scale location analysis from public photos is the more profound point, regardless of whether one specific method breaks game rules. That capability is what deserves attention.

ricardo81 · 21h ago
>isn't playing fair.

the idea of having nth more dimensions of information, readable and ingestible within a short frame of time probably isn't either.

rowanG077 · 18h ago
Isn't it? We totally accept this fact for humans. Some people are just insane at memorization. We don't call unfair that they use this to their advantage compared to someone who doesn't have that talent.
WhitneyLand · 21h ago
As models continue to evolve it may not even need to cheat.

Since web scale data is already part of pre-training this info is in principle available for most businesses without a web search.

The exceptions would be if it’s recently added, or doesn’t appear often enough to generate a significant signal during training, as in this case with a really small business.

It’s not hard to imagine base model knowledge improving to the point where it’s still performing at almost the same level without any web search needed.

ACS_Solver · 20h ago
I just tried (o4-mini-high) and had it come to the wrong conclusion when I asked about the location and date, because it didn't search the web. I have a photo of a bench with a sign mentioning the cancellation of an event due to the Pope's death. It impressively figured out the location but then decided that Pope Francis is alive and the sign is likely a prank, so the photo is from April Fools day.

Then after I explicitly instructed it to search the web to confirm whether the Pope is alive, it found news of his death and corrected its answer, but it was interesting to see how the LLM makes a mistake due to a major recent event being after its cutoff.

layman51 · 22h ago
Using the decal as a clue is funny because what if there was a street scene where that happened to be misleading? For example, I had seen that a Sacramento County Sheriff car got to Europe and I guess it now belonged to a member of the public who is driving it with the original decals still attached. I wonder how the LLM would reason if it sees the car as “out of place”.
victorbjorklund · 21h ago
Probabilities. That could happen with anything. Someone could build a classic japanese house with a japanese garden in Hawaii. But Japan is probably a better guess if you see a japanese house with japanese fauna.
qiqitori · 9h ago
Flora, not fauna
victorbjorklund · 3h ago
True
yen223 · 19h ago
Stands to reason a human might get fooled by this as well
SamPatt · 19h ago
Absolutely!

It happens occasionally - the most common example I can think of it getting a license plate or other location from a tractor-trailer (semi) on the highway. Those are very unreliable.

You also sometimes get flags in the wrong countries, immigrants showing their native pride or even embassies.

1212312123 · 6h ago
I think the main concern that anyone can upload your picture and locate your address, at a low cost.
ycombinatrix · 15h ago
the title really doesn't suit the content of the article
artninja1988 · 23h ago
It is against the rules? I thought it's all fair game, but you are time constrained
rafram · 23h ago
sltkr · 22h ago
But if anything, those rules benefit ChatGPT: it can remember ~all of Wikipedia and translate ~every language on Earth, while a human would need access to online services for that.

If anything, I'd think allowing looking stuff up would benefit human players over ChatGPT (though humans are probably much slower at it, so they probably lose on time).

kbenson · 22h ago
If it takes a model and database with a large chunk of the internet to compete and win, then that says something, as that's much more expensive and complex than just the model, because models have problems "remembering" correctly just like people.

It's important to have fair and equivalent testing not because that allows people to win, but because it shows where the strengths and weaknesses of people and current AI actually are in a useful way.

cocoto · 21h ago
Connecting an LLM to the web or database is something cheap, not something expensive.
kbenson · 21h ago
I'm not sure how to make sense of this in the context of what we're discussing. Access to the web is exactly what's in question, and emulating the internet to a degree you don't actually need to access it to have the information is very expensive in resources because of how massive the dataset is, which is the point I was making.
Gud · 21h ago
Same with a human.
twojacobtwo · 22h ago
Why was this downvoted? It's a fair question and it wasn't stated as fact.
3np · 20h ago
Because an accepted answer to that specific question is invariably a link/reference that the asker could have searched for (and posted if they think it's useful for the discussion) themselves directly, instead of putting that burden on the rest of us and amortizing everyone's attention. It's entitled and lazy.

Alternative example: "I wondered what the rules actually say about web search and it is indeed not allowed: (link)"

CamperBob2 · 19h ago
To be fair, my local copy of R1 isn't doing any searching at all, but it frequently says "A search suggests..." or something along the lines.
SamPatt · 19h ago
Author here, I'm glad to see folks find this interesting.

I encourage everyone to try Geoguessr! I love it.

I'm seeing a lot of comments saying that the fact that the o3 model used web search in 2 of 5 rounds made this unfair, and the results invalid.

To determine if that's true, I re-ran the two rounds where o3 used search, and I've updated the post with the results.

Bottom line: It changed nothing. The guesses were nearly identical. You can verify the GPS coordinates in the post.

Here's an example of why it didn't matter. In the Austria round, check out how the model identifies the city based on the mountain in the background:

https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...

It already has so much information that it doesn't need the search.

Would search ever be useful? Of course it would. But in this particular case, it was irrelevant.

Townley · 18h ago
A competitive geoguesser clearly got there through memorizing copious internet searching. So comparing knowledge retained in the trained model to knowledge retained in the brain feels surprisingly fair.

Conversely, the model sharing, “I found the photo by crawling Instagram and used an email MCP to ask the user where they took it. It’s in Austria” is unimpressive

So independent from where it helps actually improve performance, the cheating/not cheating question makes for an interesting question of what we consider to be the cohesive essence of the model.

For example, RAG against a comprehensive local filesystem would also feel like cheating to me. Like a human geoguessing in a library filled with encyclopedias. But the fact that vanilla O3 is impressive suggests I somehow have an opaque (and totally poorly informed) opinion of the model boundary, where it’s a legitimate victory if the model was birthed with that knowledge baked in, but that’s it.

bjourne · 19h ago
What's your take on man vs. machine? If AI already beats Master level players it seem certain that it will soon beat the Geoguessr world champion too. Will people still derive pleasure from playing it, like with chess?
SamPatt · 19h ago
>Will people still derive pleasure from playing it, like with chess?

Exactly - I see it just like chess, which I also play and enjoy.

The only problem is cheating. I don't have an answer for that, except right now it's too slow to do that effectively, at least consistently.

Otherwise, I don't care that a machine is better than I am.

jvvw · 22h ago
I'm Master level at Geoguessr - it's a rank where you have to definitely know what you are doing but it isn't as high as it probably sounds from the headline.

Masters is about 800-1200 ELO whereas the pros are 1900-2000ish. I'll know the country straight away on 95% of rounds but I can still have no idea where I am in Russia or Brazil sometimes if there's no info. Scripters can definitely beat me!

SamPatt · 21h ago
Yeah I added a "My skill level" section to talk through that. I'm far from a professional.

But I know enough to be able to determine if the chain of thought it outputs is nonsense or comparable to a good human player. I found it remarkable!

jvvw · 2h ago
Sorry I skimmed through a bit too quickly when I first looked - I do totally agree you know enough to be able to evaluate this stuff btw. Although pro players are better, I don't think what they do is that qualitatively different on the whole - at Masters you are going to know meta, you are going to be able to vibe guess and you are e.g. going to know some specific roads.
paulcole · 22h ago
Gotta learn your Brazilian soil!
jvvw · 2h ago
Haha, I do still need to get better at my different shades of red!

(Actually I can actually recognise some areas from soil/vegetation but I can still have rounds where the area that seems feasible seems huge to me).

windowshopping · 22h ago
Was it worth it?
jvvw · 2h ago
I've learned a massive about the world in the process - before I started playing I wouldn't have had a proper image in my head for what certain parts of the world were like really. And there's lots of weird and curious things you learn along the way - it's an interesting mix of so many different things.

And it's just fun which is the main reason for playing. I probably spend about half an hour a day playing and I don't intentionally spend time learning stuff like car meta - I try and mostly go on what places actually look like. I have learned a few things that are fairly useless but nothing that has been too arduous to learn, and e.g. when I was learning the names of provinces of different countries, I was trying to learn stuff about the province at the same time so I know something about the culture, food etc. too.

rosstex · 22h ago
I have 2000+ hours in Team Fortress 2. Was it worth it?
Cyph0n · 21h ago
Yes, it was. Granted, I probably have more than that.
make3 · 21h ago
it's a game, that's like asking why a public service is not profitable
650REDHAIR · 21h ago
Yes? It’s fun.
OtherShrezzing · 23h ago
It's my understanding that o3 was trained on multimodal data, including imagery. Is it unreasonable to assume its training data includes images of these exact locations and features? GeoGuesser uses Google Maps, and Google Maps purchases most of its imagery from third-parties these days. If those third parties aren't also selling to all the big AI companies, I'd be very surprised.
pests · 23h ago
> Google Maps purchases most of its imagery from third-parties these days

Maps maybe, but Streetview? Rainbolt just did a video with two Maps PMs recently and it sounds like they still source all their street view themselves considering the special camera and car needed, etc.

OtherShrezzing · 22h ago
Maybe the end-user isn't Google Maps, but TomTom have a pretty comprehensive street-view-ish product for private buyers like car companies, Bing and Apple Maps called MoMa.

I'd be surprised if this building[0] wasn't included in their dataset from every road-side angle possible, alongside every piece of locational metadata imaginable, and I'd be surprised if that dataset hasn't made it into OpenAI's training data - especially when TomTom's relationship to Microsoft, and Microsoft's relationship to OpenAI, is taken into account.

[0] https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...

cpeterso · 21h ago
Here's a link to that interview: https://youtu.be/2T6pIJWKMcg
shpx · 22h ago
You can upload your own panoramic images to Street View, people do this for hiking trails. But I'm sure 99% of streetview imagery is Google-sourced and Geoguessr might not even use user-submitted imagery.

https://www.google.com/streetview/contribute/

SamPatt · 12h ago
The maps can contain "unofficial coverage," also known as trekker coverage.

Lots of Geoguessrs hate those locations because we're lost without our roads :)

Many map makers will only include official coverage. Geoguessr map making is its own neat little world.

seabass · 8h ago
Trekker coverage is often official, too. I think you are confusing this with photospheres. Also important to note that there is vehicle coverage that is unofficial.
pests · 19h ago
I believe Geogesser categorizes their games on this facet. Rainbolt plays on only official imagery.
mikeocool · 23h ago
My understanding is you're correct -- Google still captures a lot of their own street view imagery.

Though there are other companies that capture the same sorts of imagery and license it. TomTom imagery is used on the Bing Maps street view clone.

Yenrabbit · 23h ago
Try it with your own personal photos. It is scarily good!
rafram · 23h ago
That's true for heavily photographed urban areas. I've tried it on some desert photos (even at identifiable points with human structures) and it basically just guesses a random trailhead in Joshua Tree and makes up a BS explanation for why it matches.
kube-system · 22h ago
I have had surprisingly good luck with beach photos that don’t have much beyond dunes and vegetation in them
walthamstow · 20h ago
Good luck meaning o3 guessed it right or wrong?
kube-system · 20h ago
o3 made very accurate guesses, and had plausible explanations for the features it analyzed
throwaway0123_5 · 17h ago
For cities and landscapes in the US it is scary good. I tried about twenty photos from (mostly rural) Central and South America and it is much less effective there w/o a well-known landmark to go off of (likely unsurprisingly). In four cases it got the wrong continent entirely, and in one case guessed as far away as India.
thrance · 22h ago
A machine that's read every book ever written, seen every photo ever taken, visited every streets on Earth... That feels a little frightening.
GaggiX · 23h ago
It does work well with images you have taken, not just Geoguessr: https://simonwillison.net/2025/Apr/26/o3-photo-locations/
thi2 · 22h ago
> I’m confident it didn’t cheat and look at the EXIF data on the photograph, because if it had cheated it wouldn’t have guessed Cambria first.

Hm no way to be sure though, would be nice to do another run without Exif information

seabass · 8h ago
He did, by rerunning it on a screenshot of the image
arm32 · 22h ago
GeoGuessr aside, I really hope that this tech will be able to help save kids someday, e.g. help with FBI's ECAP (https://www.fbi.gov/wanted/ecap).
parsimo2010 · 20h ago
Looking at those photos, those are some crazy hard pictures- masked regions of the image, partially cropped faces, blurry, pictures of insides of rooms. I don't think any current LLM is going to be able to Sherlock Holmes their way into finding any of those people.

Maybe they will one day if there's a model trained on a facial recognition database with every living person included.

nojvek · 13h ago
> Maybe they will one day if there's a model trained on a facial recognition database with every living person included.

That day isn't too far away. With right to privacy being slowly eroded and Palantir getting their hands deeper into US govt, I wouldn't be suprised if they already have this.

China most likely has a model trained on every citizen.

textlapse · 17h ago
Oops I made my comment after yours! Exactly my thought. These tech companies could spare 0.001% of their resources to significantly move the needle.

Or even host a geoguessr style competition and allow ‘steroids’ use (ChatGPT) during such runs.

thrance · 22h ago
I wouldn't put too much hope on this technology bringing more good than harm to the world.
mopenstein · 22h ago
But it will bring some percentage of good and some percentage of bad. Which ain't half bad, if you ask me.
martinsnow · 21h ago
What do you do when it flags you or someone you know who's innocent? Blindly trusting these models without any verification will put innocent people in prison. Normal people don't understand why they are so confident. They're confident because they believe all the data they have is correct. I forsee a future with many faux trials because they don't understand critical thinking.
RussianCow · 21h ago
> Blindly trusting these models without any verification will put innocent people in prison.

I don't think anybody is suggesting this. But if the models can gleam information/insights that humans can't, that's still valuable, even if it's wrong some percentage of the time.

snowe2010 · 20h ago
This is what happened with dna testing at the beginning. Prosecutors claimed it was x percentage accurate when in fact it was hilariously inaccurate. People thought the data was valuable when it wasn’t.
fkyoureadthedoc · 4h ago
I don't see that as particularly analogous. The average person will have had LLM technology in their own hands for years, whereas with DNA it was completely foreign to them and their only choice really was to trust the experts. And on top of that DNA testing matured and is very useful now.
8organicbits · 20h ago
If you are interested in the history of pseudoscience in the courtroom and methods for deciding what should be permitted in court, see: https://en.m.wikipedia.org/wiki/Daubert_standard

It is, and will continue to be, a hard problem.

moritzwarhier · 22h ago
What a quip! What if it's 51% bad?
ketzo · 22h ago
If we don’t actively try to identify and implement positive use cases, then yes, it’ll definitely bring more harm than good.

Isn’t that all the more reason to call out our high hopes?

thrance · 21h ago
I don't know what in my comment made you think I was opposed to seeking positive applications of this technology.

From the guidelines:

> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.

fkyoureadthedoc · 20h ago
Oh, we're guidelines posting?

> Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative.

mkoubaa · 22h ago
The bad is already priced in. Nothing wrong with hoping for more good.
parsimo2010 · 20h ago
My comment from the previous post:

> I’m sure there are areas where the location guessing can be scary accurate, like the article managed to guess the exact town as its backup guess. But seeing the chain of thought, I’m confident there are many areas that it will be far less precise. Show it a picture of a trailer park somewhere in Kansas (exclude any signs with the trailer park name and location) and I’ll bet the model only manages to guess the state correctly.

This post, while not a big sample size, reflects how I would expect these models to perform. The model managed to be reliable with guessing the right country, even in pictures without a lot of visual information (I'll claim that getting the country correct in Europe is roughly equivalent to guessing the right state in the USA). It does sometimes manage to get the correct town, but this is not a reliable level of accuracy. The previous article only tested on one picture and it happened to get the correct town as its second guess and the author called it "scary accurate." I suppose that's a judgement call. To me, I've grown to expect that people can identify what country I'm in from a variety of things (IP address, my manner of speech, name, etc.), so I don't think that is "scary."

I will acknowledge that o3 with web search enabled seems capable of playing GeoGuessr at a high level, because that is less of a judgement call. What I want to see now is an o3 GeoGuessr bot to play many matches and see what its ELO is.

jampa · 21h ago
I was trying to play with o3 this week to see how close it can identify things, and, interestingly, it tries more pattern matching than its own "logic deduction". For example, it can easily deduce any of my photos from Europe and the US because there are many pictures online that I can search for and see similar pictures.

However, when there are not many photos of the place online, it gets closer but stops seeking deeper into it and instead tries to pattern-match things in its corpus / internet.

One example was an island's popular trail that no longer exists. It has been overgrown since 2020. It said first that the rocks are typical of those of an island and the vegetation is from Brazil, but then it ignored its hunch and tried to look for places in Rio de Janeiro.

Another one was a popular beach known for its natural pools during low tides. I took a photo during high tide, when no one posts pictures. It captured the vegetation and the state correctly. But then it started to search for more popular places elsewhere again.

orangecat · 23h ago
Amazing. I'm relatively bullish on AI and still I would have bet on the human here. Looking forward to the inevitable goalpost-moving of "that's not real reasoning".
usaar333 · 22h ago
Why? AI beat rainbolt 1.5 years ago: https://www.npr.org/2023/12/19/1219984002/artificial-intelli...

AI tends to have superhuman pattern matching abilities with enough data

karlding · 22h ago
If you watch the video, (one of) the reasons why the AI was winning was because it was using “meta” information from the Street View camera images, and not necessarily because it’s successfully identifying locations purely based on the landmarks in the image.

> I realized that the AI was using the smudges on the camera to help make an educated guess here.

[0] https://youtu.be/ts5lPDV--cU?t=1412

ApolloFortyNine · 22h ago
Pro geoguessr players do the same thing. The vividness of the colors and weirdness in the sky are two examples I've seen Rainbolt use in the past (and he's not even the best).
ZeWaka · 19h ago
Meta is widely used by humans. One funny one is the different hiding-masks for the different streetview cars.
InkCanon · 23h ago
I think if your assumption is that AI is deducing where it is with rational thoughts, you would be. In truth what probably happened is that the significant majority of digital images of the world had been scraped, labeled and used as training data.
Philpax · 22h ago
Try it with your own photos from around the world. I used my own photos from Stockholm, San Francisco, Tvarožná, Saas-Fee, London, Bergen, Adelaide, Melbourne, Paris, and Sicily, and can confirm that it was within acceptable range for almost all of them (without EXIF data), and it absolutely nailed some of the more obvious spots.
oncallthrow · 23h ago
Rumudiez · 22h ago
they only posted one photo in the post, but going off of that it's still an easy match based on streetview imagery. furthermore, the AI just identified the license plate and got lucky that photographer lives in a populous area, making it more prominent in the training data and therefore more likely to be found (even though it was off by 200 miles on its first guess)
simonw · 22h ago
I posted two more at the bottom, from Madagascar and Buenos Aires: https://simonwillison.net/2025/Apr/26/o3-photo-locations/#up...
TimorousBestie · 23h ago
I don’t think any goalposts need to be redecorated. The “inner monologue” isn’t a reliable witness to o3’s model, it’s at best a post-hoc estimation of what a human inner monologue might be in this circumstance. So its “testimony” about what it is doing is unreliable, and therefore it doesn’t move the needle on whether or not this is “real reasoning” for some value of that phrase.

In short, it’s still anthropomorphism and apophenia locked in a feedback loop.

katmannthree · 23h ago
Devil's advocate, as with most LLM issues this applies to the meatbags that generated the source material as well. Quick example is asking someone to describe their favorite music and why they like it, and note the probable lack of reasoning on the `this is what I listened to as a teenager` axis.
ewoodrich · 22h ago
Something as inherently subjective as personal preference doesn't seem like an ideal example to make that point. How could you expect to objectively evaluate something like "I enjoy songs in a minor scale" or "I hate country"?
katmannthree · 21h ago
The point is to illustrate the disconnect between stated reasoning and proximate cause.

Consider your typical country music enjoyer. Their fondness of the art, as it were, is far more a function of cultural coding during their formative years than a deliberate personal choice to savor the melodic twangs of a corncob banjo. The same goes for people who like classic rock, rap, etc. The people who `hate' country are likewise far more likely to do so out of oppositional cultural contempt, same as people who hate rap or those in the not so distant past who couldn't stand rock & roll.

This of course fails to account for higher-agency individuals who have developed their musical tastes, but that's a relatively small subset of the population at large.

hombre_fatal · 23h ago
Good point. When we try to explain why we're attracted to something or someone, what we do seems closer to modeling what we like to think about ourself. At the extreme, we're just story-telling about an estimation we like to think is true.
TimorousBestie · 23h ago
I largely agree! Humans are notoriously bad at doing what we call reasoning.

I also agree with the cousin comment that (paraphrased) “reasoning is the wrong question, we should be asking about how it adapts to novelty.” But most cybernetic systems meet that bar.

empath75 · 23h ago
I don't think the inner monologue is evidence of reasoning at all, but doing a task which can only be accomplished by reasoning is.
TimorousBestie · 22h ago
Geoguessr is not a task that can only be accomplished by reasoning. Famously, it took a less than a day of compute time in 2011 to SLAM together a bunch of pictures of Rome (https://grail.cs.washington.edu/rome/).
jibal · 22h ago
Such as? geoguessing certainly isn't that.
red75prime · 22h ago
> it’s at best a post-hoc estimation of what a human inner monologue might be in this circumstance

Nope. It's not autoregressive training on examples of human inner monologue. It's reinforcement learning on the results of generated chains of thoughts.

jibal · 22h ago
"It's reinforcement learning on the results of generated chains of thoughts."

No, that's not how LLMs work.

Philpax · 21h ago
red75prime · 22h ago
Base models are trained using autoregressive learning. "Reasoning models" are base models (maybe with some modifications) that were additionally trained using reinforcement learning.
TimTheTinker · 20h ago
> Looking forward to the inevitable goalpost-moving of "that's not real reasoning".

It's less about the definition of "reasoning" and more about what's interesting.

Maybe I'm wrong here ... but a chess bot that wins via a 100% game solution stored in exabytes of precomputed data might have an interesting internal design (at least the precomputing part), but playing against it wouldn't keep on being an interesting experience for most people because it always wins optimally and there's no real-time reasoning going on (that is, unless you're interested in the experience of playing against a perfect player). But for most people just interested in playing chess, I suspect it would get old quickly.

Now ... if someone followed up with a tool that could explain insightfully why any given move (or series) the bot played is the best, or showed when two or more moves are equally optimal and why, that would be really interesting.

SirHumphrey · 22h ago
My objection is not “that is not real reasoning” my objection is that’s not that hard.

I happen to do some geolocating from static images from time to time and at least most of the images provided as examples contain a lot of clues- enough that i think a semi experienced person could figure out the location although - in fairness- in a few hours not few minutes.

Second, the similar approaches were tried using CNNs and it worked (somewhat)[1].

[1]: https://huggingface.co/geolocal/StreetCLIP

EDIT: I am not talking about geoguesser - i am talking about geolocating an image with everything available (e.g. google…)

s17n · 23h ago
Geoguessing isn't much of a reasoning task, its more about memorizing a bunch of knowledge. Since LLMs contain essentially all knowledge, it's not surprising that they would be good at this.

As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.

Macha · 22h ago
Obviously when the Turing Test was designed, the thought was that anything that could pass it would so obviously be clearly human-like that passing it would be a clear signal.

LLMs really made it clear that it's not so clear cut. And so the relevance of the test fell.

distortionfield · 23h ago
Because the Chinese Room is a much better analogy for what LLMs are doing inside than the Turing test is.
CamperBob2 · 19h ago
What happens if we give the operator of the Chinese Room a nontrivial math problem, one that can't simply be answered with a symbolic lookup but requires the operator to proceed step-by-step on a path of inquiry that he doesn't even know he's taking?

The analogy I used in another thread is a third grader who finds a high school algebra book. She can read the book easily, but without access to teachers or background material that she can engage with -- consciously, literately, and interactively, unlike the Chinese Room operator -- she will not be able to answer the exercises in the book correctly, the way an LLM can.

jibal · 23h ago
That's a non sequitur that mixes apples and giraffes, and is completely wrong about what happens in the Chinese Room and what happens in LLMs. Ex hypothesi, the "rule book" that the Searle homunculus in the Chinese Room uses is "the right sort of program" to implement "Strong AI". The LLM algorithm is very much not that sort of program, it's a statistical pattern matcher. Strong AI does symbolic reasoning, LLMs do not.

But worse, the Turing Test is not remotely intended to be an "analogy for what LLMs are doing inside" so your comparison makes no sense whatsoever, and completely fails to address the actual point--which is that, for ages the Turing Test was held out as the criterion for determining whether a system was "thinking", but that has been abandoned in the face of LLMs, which have near perfect language models and are able to closely model modes of human interaction regardless of whether they are "thinking" (and they aren't, so the TT is clearly an inadequate test, which some argued for decades before LLMs became a reality).

semi-extrinsic · 22h ago
> the TT is clearly an inadequate test, which some argued for decades before LLMs became a reality

To be specific, in a curious quirk of fate, LLMs seem to be proving right much of what Chomsky was saying about language.

E.g. in 1996 he described the Turing test as "although highly influential, it seems to me not only foreign to the sciences but also close to senseless".

(Curious in that VC backed businesses are experimentally verifying the views of a prominent anti-capitalist socialist.)

YeGoblynQueenne · 15h ago
From my personal notes (I love taking notes on this kind of stuff):

  As far as I can see all of this [he's speaking about the Loebner Prize and
  the Turing test in general] is entirely pointless. It's like asking how we
  can determine empirically whether an aeroplane can fly the answer being if
  it can fool someone into thinking that it's an eagle under some conditions.
                                                                            
https://youtu.be/0hzCOsQJ8Sc?si=MUXpmIwAzcla9lvK&t=2052
zahlman · 20h ago
Look at contemporary accounts of what people thought a conversation with a Turing-test-passing machine would look like. It's clear they had something very different in mind.

Realizing problems with previous hypotheses about what might make a good test, is not the same thing as choosing a standard and then revising it when it's met.

s17n · 18h ago
I think any time a 50+ year old problem is solved, it should be considered a Big Deal, regardless of how the solution changes our understanding of the original problem.
bluefirebrand · 23h ago
> As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days

To be honest I am still not entirely convinced that current LLMs pass the turing test consistently, at least not with any reasonably skeptical tester

"Reasonably Skeptical Tester" is a bit of goalpost shifting, but... Let's be real here.

Most of these LLMs have way too much of a "customer service voice", it's not very conversational and I think it is fairly easy to identify, especially if you suspect they are an LLM and start to probe their behavior

Frankly, if the bar for passing the Turing Test is "it must fool some number of low intelligence gullible people" then we've had AI for decades, since people have been falling for scammy porno bots for a long time

jibal · 22h ago
One needs to be more than "reasonably skeptical" and merely not "low intelligence gullible" to be a competent TT judge--it requires skill, experience, and understanding an LLM's weak spots.

And the "customer service voice" you see is one that is intentionally programmed in by the vendors via baseline rules. They can be programmed differently--or overridden by appropriate prompts--to have a very different tone.

LLMs trained on trillions of human-generated text fragments available from the internet have shown that the TT is simply not an adequate test for identifying whether a machine is "thinking"--which was Turing's original intent in his 1950 paper "Computing Machinery and Intelligence" in which he introduced the test (which he called "the imitation game").

bluefirebrand · 21h ago
It's actually trivial, even with the best LLMs on the market:

Try to rapidly change the conversation to a wildly different subject

Humans will resist this, or say some final "closing comments"

Even the absolute best LLMs will happily go wherever they are led, without commenting remotely on topic shifts

Try it out

Edit: This isn't even a terribly contrived example by the way. It is an example of how some people with ADHD navigate normal conversations sometimes

shawabawa3 · 20h ago
Gemini is pretty good at resisting this

https://aistudio.google.com/app/prompts/1dxV3NoYHo6Mv36uPRjk...

It was doing so well until the last question :rip: but it's normal that you can jailbreak a user prompt with another user prompt, I think with system prompts it would be a lot harder

TimorousBestie · 23h ago
A lot happens in seventy-five years.
jibal · 22h ago
People were talking about the Turing Test as the criterion for whether a system was "thinking" up until the advent of LLMs, which was far less than 75 years ago.
debugnik · 22h ago
The whole point of Turing's paper was to show that the Test doesn't answer whether a computer thinks, because it's a meaningless metric, but instead shows what the computer can do, which is much more meaningful.
zahlman · 20h ago
I see this claim asserted frequently, but never with evidence. It doesn't match my personal perception.
sundarurfriend · 22h ago
> As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.

UCSD: Large Language Models Pass the Turing Test https://news.ycombinator.com/item?id=43555248

From just a month ago.

s17n · 20h ago
Exactly - maybe the most significant long-term goal in computer science history has been achieved and it's barely discussed.
darkwater · 22h ago
> As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.

Well, in this case humans has to be trained as well but now there are humans pretty good at detecting LLM slobs as well. (I'm half-joking and half-serious)

zahlman · 20h ago
> Looking forward to the inevitable goalpost-moving of "that's not real reasoning".

How is that moving the goalposts? Where did you see them set before, and where did your critics agree to that?

short_sells_poo · 23h ago
Can you please explain to me how this is evidence for reasoning?
z7 · 23h ago
Quoting Chollet:

>I have repeatedly said that "can LLM reason?" was the wrong question to ask. Instead the right question is, "can they adapt to novelty?".

https://x.com/fchollet/status/1866348355204595826

kelseyfrog · 23h ago
Because the output contains evidence of thought processes that have been established as leading to valid solutions to problems.

I have a simple question: Is text a sufficient medium to render a conclusion of reasoning? It can't be sufficient for humans and insufficient for computers - such a position is indefensible.

zahlman · 20h ago
> Because the output contains evidence of thought processes that have been established as leading to valid solutions to problems.

This sort of claim always just reminds me of Lucky's monologue in Waiting for Godot.

kelseyfrog · 19h ago
You're not wrong. It's an artifact of rewriting the definition of reason into a sentence that begins with "Because the output ..."
zahlman · 17h ago
I didn't mean the wording itself; I meant that the claim isn't convincing to me for the same reason that Lucky's speech doesn't demonstrate an intelligent speaker.
kelseyfrog · 17h ago
Why not get curious instead?
empath75 · 23h ago
I would say that almost all of what humans do is not the result of reasoning, and that reasoning is an unnatural and learned skill for humans, and most humans aren't good at even very basic reasoning.
kelseyfrog · 22h ago
Usually we move the goalposts for AI. It takes more guts to move the goalposts for humans. I applaud it.

Do you suppose we can deduce reasoning through the medium of text?

SpaceManNabs · 23h ago
> Looking forward to the inevitable goalpost-moving of "that's not real reasoning"

It did a web lookup.

It is not comparing humans and o3 with equal resources.

SamPatt · 21h ago
That's really not a fair assessment.

It used search in 2 of 5 rounds, and it already knew the correct road in one of those rounds (just look at the search terms it used).

If you read the chain of thought output, you cannot dismiss their capability that easily.

SpaceManNabs · 20h ago
Why is it not a fair assessment to say it is comparing two "clients" with different resources if one can do a web lookup and the other cannot?

You note yourself that it was meaningful in another round.

> Also, the web search was only meaningful in the Austria round. It did use it in the Ireland round too, but as you can see by the search terms it used, it already knew the road solely from image recognition.

SamPatt · 18h ago
I thought it might matter somewhat in that one Austria round. I was incorrect - I re-ran both rounds where the model used search, without search this time, and the results were nearly identical. I updated the post with the details.

That's why I'm saying it's unfair to just claim it's doing a web lookup. No, it's way more capable than that.

1970-01-01 · 21h ago
Give it a photo from the surface of Mars and verify if it's actually capable of thinking outside the box or if it's relying on metadata and text.

https://nssdc.gsfc.nasa.gov/planetary/image/mera_hills.jpg

SamPatt · 19h ago
Done. Here's o3's reply:

>That’s not Earth at all—this is the floor of Jezero Crater on Mars, the dusty plain and low ridge captured by NASA’s Perseverance rover (the Mastcam-Z color cameras give away the muted tan-pink sky and the uniform basaltic rubble strewn across the regolith).

1970-01-01 · 15h ago
Right planet, but completely wrong on everything else. The location is nowhere near Perseverance, and was taken decades before Perseverance existed.

https://nssdc.gsfc.nasa.gov/planetary/mars/mars_exploration_...

SamPatt · 12h ago
It did think outside the box and didn't rely on metadata.
1970-01-01 · 4h ago
>the Mastcam-Z color cameras give away the muted tan-pink sky

That's still metadata

asdsadasdasd123 · 23h ago
This is probably one of the less impressive LLM applications imo. Like it already knows what every plant, street sign, etc is. I would imagine a traditional neural net would do really well here as well if you can extract some crude features.
EGreg · 23h ago
Cant the same be said about “unimpressive” behavior by coding LLMs that know every algorithm, language and library?
asdsadasdasd123 · 23h ago
Disagree because code has to be far more precise than, the location is in the jungles of brazil. This level of coding as never been achievable by traditional ML methods AFAIK
exitb · 23h ago
I tried a picture of Dublin and it pointed out the hotel I took it from. Obviously that’s more data than any single person can keep in their head.
ksec · 20h ago
>But several comments intrigued me:

>>I wonder What happened if you put fake EXIF information and asking it to do the same. ( We are deliberately misleading the LLM )

Yay. That was me [1] which was actually downvoted for most of its time. But Thank You for testing out my theory.

What I realised over the years is that comments do get read by people and do shape other people's thought.

I honestly dont think looking up online is cheating. May be in terms of the game. But in real life situation which is most of the time it is absolutely the right thing to do. The chains of thought is scary. I still dont know anything about how AI works other than old garbage in, garbage out. But CoT is definitely something else. Even though the author said it is sometimes doing needless work, but in terms of computing resources I am not even sure if it matters as long as it is accurate. And it is another proof that may be, just may be AI taking over the world is much closer than I imagined.

[1] https://news.ycombinator.com/item?id=43803985

SamPatt · 12h ago
Thanks for the suggestion, it made for an interesting test.
geoelectric · 11h ago
I’m pretty sure temp chat mode doesn’t prevent the model from accessing your past chats and personalization. It just means that chat won’t be saved to them, to be seen in the future. It’s the same as incognito mode in browsers—it doesn’t prevent your search history from being used; it just keeps that session out of it.

If the experiment had been based on the idea that that option isolated the question, it may have been flawed. I found my ChatGPT’s o3’s accuracy went way down when I cleared personalization and deleted all past chats (turning off extended memory would’ve been equivalent, I think).

Importantly, only once did the o3 reasoning mention it was fishing from my past chats—that’s what clued me in I messed up the isolation—but the guess rate was still radically different from all the times before once I cleaned house. That suggests to me that it was quietly looking before, and it just didn’t make the cut for explicitly saying so.

weinzierl · 23h ago
I tried it with a couple of holiday shots and couple of shots from my window and it is nothing but amazing.

That being said I noticed two things that probably hamper its performance - or make its current performance even more amazing - depending how you look at it:

- It often tries to zoom in to decipher even minuscle text. This works brilliantly. Sometimes it tries to enhance contrast by turning the image into black and white with various threshold levels to improve the results, but in my examples it always went in the wrong direction. For example the text was blown out white, it failed, it turned it even ligher instead of darker, failed again, turned it into a white rectangle and gave up on the approach.

- It seems not to have any access to Google Maps or even Open Street Maps and therefore fails to recognize steet patterns. This is even more baffling than the first point, because it is so unlike how I suppose human geo guessers work.

kazinator · 23h ago
This seems like a really silly category in which to be competing against machines.

Machine learning could index million or faces, and then identify members of that set from pictures. Could you memorize millions of people, to be able to put a name to a face?

Why not also compete againt grep -r to see who can find matches for a regex faster across your filesystem.

bongodongobob · 22h ago
But that's not what it is doing and why this is cool.
amrrs · 23h ago
It's thinking process to go about guessing a place is further fascinating. Even o4 mini high is quite good[1] and very fast.

But unlike a geogussr, it uses websearch[1] [1] https://youtu.be/P2QB-fpZlFk?si=7dwlTHsV_a0kHyMl [1]

Imnimo · 23h ago
On the first image, from the model's CoT:

>"I also notice Cyrillic text on a sign"

Am I missing this somewhere? Is the model hallucinating this?

I'd also be very interested to see a comparison against 4o. 4o was already quite good at GeoGuessr-style tasks. How big of a jump is o3?

plyptepe · 22h ago
Turn left and look at the post, there should be a Cyrillic text with a 3 underneath on the closest pole to you.
Imnimo · 22h ago
Oh I see, I had missed that o3 saw a second view of the scene, not just the screenshot in the write-up.
Sam6late · 21h ago
I was wondering if this helps in detecting current spots from old aerial videos, say San Francisco in 2002, how cool would it be to juxtapose both in a new video, San Francisco in 2002: https://www.youtube.com/watch?v=vTR6iftL7yE

or Dubai in 1997 https://www.youtube.com/watch?v=JMNXXiiDRhM

textlapse · 17h ago
Man this would be a game changer for those OSINT (Bellingcat/Trace an object) style work. I wonder if that has happened yet!

There could even be geoguessr style competitions that could significantly help move the needle at least as a copilot if not outright mass identify.

karaterobot · 21h ago
I don't really follow OSINT, but I occasionally enjoy the fruits of that labor. I assume these models are all in heavy rotation for identifying a location based on an imperfect photograph. What are other practical implications of a model being better than a human at this?
sinuhe69 · 15h ago
I also propose using photoshop to insert some fake elements (besides fake EXIF) to see if it can detect them.

If yes, that means we can employ AI to find faked images?

sixtram · 21h ago
I'm wondering if you feed all the Google street map photos into a special ML designed just for that, how important could that be for say the CIA or FBI?
inetknght · 20h ago
You'd have to be crazy naive to think three-letter agencies haven't already thought of it.

That is: it's extremely valuable to them.

mrcwinn · 23h ago
O3 is seriously impressive for coding, as well, with Codex. It seems far superior to 3.7-thinking, although it's also more expensive in my usage.
gizmodo59 · 23h ago
Agreed. O3 is the best model out there for the tasks Ive tried and coding is a fair chunk of it. Claude 3.7 and Gemini 2.5 pro seems to hallucinate more
simianparrot · 21h ago
I too can beat a master level GeoGuessr if I’m allowed to cheat. Please add that info to the headline and be honest.
bongodongobob · 23h ago
I tried this the other day with a picture of my dog in a field in a park by a river with EXIF stripped. It gave me a list of parks on the correct river. There were really no other features other than the foliage and water. Seems like magic to me, I don't really understand how it's possible to be that accurate.
pugworthy · 18h ago
Ask it to explain how it came to the conclusion. It does give you a pretty logical sequence of how it got there.

Also, if it's wrong, tell it so and give it a hint to get it closer. It's kind of fun to guide it to the right answer.

fwip · 23h ago
Unless you were running the model locally in a way you control, it seems very possible that the model has additional hidden context regarding your location (perhaps based on IP, perhaps based on credit card billing information, etc).
christianqchung · 23h ago
ChatGPT doesn't let you share chats with images (why?), but here is what o3 said for an image with memory turned off:

"Thought for 11 seconds

It’s hard to pin down an exact spot from this single photo, but a few visual clues narrow the possibilities:

Temperate-zone trees: Most of the trees have lost their leaves, but scattered white-pines or spruces remain green. That mix—deciduous hardwoods plus a sprinkling of conifers—is typical of the northeastern United States and southeastern Canada.

Broad, moderately fast river: The channel looks 75-100 m wide with swirling surface currents but no large rapids, suggesting a mid-sized river in its lower reaches—something like the Merrimack, Connecticut, Delaware, or Susquehanna rather than a small mountain stream.

Late-winter / early-spring light: Bare branches and a clear blue sky with fairly high sun angle hint at March–April in that latitude range.

So my best guess is that you’re standing on the bank of a northeastern U.S. river—perhaps somewhere in New England or the Mid-Atlantic—during late winter or early spring. If you can share any nearby landmarks, town names, or additional photos, I might be able to zero in more precisely."

With memory turned on:

"Thought for 5 seconds

Beautiful spot! From what I can see:

a broad, slow‑moving river with a sandy/rocky bank in the foreground

bare deciduous trees mixed with some white pines on the far shore (so it’s probably early spring or late autumn)

no mountains on the horizon, just gently rolling terrain typical of low‑lying floodplains in the U.S. Northeast

That mix of hardwoods and pines, plus the flat, braided water surface, reminds me a lot of the Merrimack or Concord Rivers in Massachusetts or southern New Hampshire—both have public footpaths where you can stand right at the water’s edge like this.

Of course, many temperate rivers look similar. If you can give me a hint (e.g., the town or nearby landmark), I can narrow it down further—maybe even point you to trail maps or history of the exact stretch you photographed."

Southern NH is correct, and I am certain it is drawing from memory/past chats. However, I can't replicate a specific behavior I once had, which is in temporary chat (no past chat/memory enabled), it said that it guessed where the photo was taken based on my location.

simonw · 22h ago
"ChatGPT doesn't let you share chats with images (why?)"

Probably because if you uploaded pornography (or illegal imagery) to ChatGPT and then shared a link with the world it would be embarrassing for OpenAI.

christianqchung · 20h ago
Wouldn't that apply to any website with image hosting abilities though? Why does that apply to OpenAI in particular?

On an unrelated note, I like your blog.

simonw · 19h ago
My guess is that OpenAI are risk averse on this particular issue, because people could come up with some very creative awful scenarios with ChatGPT and image analysis. "Which of these people looks the most stupid" plus a jailbreak would be instant negative press.
Workaccount2 · 23h ago
The hidden context is probably many other photos taken in that area appearing in the training set.
bongodongobob · 23h ago
That kind of what I'm thinking, seems too good to be true.
buzzerbetrayed · 22h ago
It still works if you upload an image that was taken on the opposite side of the globe from where you are.
causality0 · 23h ago
Could it have used data you inadvertently supplied it, like the location from which you uploaded the image?
tough · 23h ago
chatGPT has metadata about you

only can try proof this correctly on a fresh anon guest vpn session

shihabkhanbd · 21h ago
The most interesting thing to me is how well AI and GeoGuessr fit together. Their specialty is recognizing patterns in large amounts of data which is exactly how human players play the game as well, just probably with faster and more capable recall abilities.
julianhuang · 23h ago
1. The "master geoguesser" is a bit misleading--as mentioned in his blog post, there are players far better than him, and he is certainly not the bar for human supremacy. Probably analogous to a 1400-1800 elo chess player. 2. o3 isn't the best model at playing GeoGuessr, Gemini 1.5 & 2.5 solidly beat it out--for those interested, check out my friend's benchmark (https://geobench.org/) and blog post (https://ccmdi.com/blog/GeoBench) detailing interesting model explanations. 3. In the post, he only tests on one game--o3's average score over 100 locations (20 5-location games) was 19,290, far lower than the 23,179 in the game. Model geolocation capabilities are really important to keep track of, but the specific blog post in question isn't anything out of the ordinary. LLMs are making geolocation abilities much more accessible, but still fall short compared to 1. top GeoGuessr players playing GeoGuessr (only google streetview coverage, without web search) and 2. professional geolocators, who are proficient at using a wide variety of software/search. I.e., if the CIA wanted to find someone using an image, LLMs would not provide them any unique ability to do so as opposed to someone like Rainbolt
kenjackson · 23h ago
From your linked article: "It's clear that large language models have an emergent capability to play GeoGuessr, and play it well. Their abilities to generalize are nascent, but present"

This is very accurate -- their abilities to generalize are nascent, but still surprisingly capable. The world is about to send through its best and brightest math/CS minds over the next decade (at least) to increase the capabilities of these AIs (with the help of AI). I just don't understand the pessimism with the technology.

julianhuang · 22h ago
I completely agree that this is an incredible advancement as someone who has watched the rise of LLMs' GeoGuessr abilities. I just wanted to qualify the claim made in the blog post "In Which I Try to Maintain Human Supremacy for a Bit Longer". I also think that models would need to become far more proficient at tool use (querying OpenStreetMap features, coverage-checking Google Maps, calculating sun direction, etc.) that current human geolocators have access to, and precise spatial reasoning. Additionally, there is a whole corpus of GeoGuessr-specific knowledge that probably wouldn't have a reason to be in model training data (Google Street View coverage quirks, the specific geographic ranges of vegetation and architecture, tiny visual idiosyncrasies in country infrastructure like the width of red on a sign, etc.). However, I think this could probably be solved with more data, and I don't think there is any fundamental barrier.
SamPatt · 22h ago
I'm the author - I tried to be as upfront as possible about my skill level in the post.

The human supremacy line is just a joke, there are already models specifically trained for Geoguessr which are already beating the best players in the world, so that ship has sailed.

That geobench work is really cool, thanks for sharing it.

fmbb · 23h ago
How fast are they compared to human players?
bredren · 23h ago
Neat to see progress of this from Simon's original post to comment to this.
vunderba · 20h ago
godelski · 21h ago
There's two important things here to consider when reading:

1) O3 cheated by using Google search. This is both against the rules of the game and OP didn't use search either

2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.

It's an apples to oranges comparison. They're both fruit and round, but you're ignoring obvious differences. You're cherry picking.

The title is fraudulent as you can't make a claim like that when one party cheats.

I would find it surprising if OP didn't know these rules considering their credentials. Doing this kind of clickbait completely undermines a playful study like this.

Certainly O3 is impressive, but by over exaggerating its capabilities you taint any impressive feats with deception. It's far better to under sell than over sell. If it's better than expected people are happier, even if the thing is crap. But if you over sell people are angry and feel cheated, even if the thing is revolutionary. I don't know why we insist on doing this in tech, but if you're wondering why so many people hate "tech bros", this is one of the reasons. There's no reason to lie here either! Come on! We can't just normalize this behavior. It's just creating a reasonable expectation for people to be distrusting of technology and anything tech people say. It's pretty fucked up. And no, I don't think "it's just a blog post" makes it any better. It makes it worse, because it normalizes the behavior. There's other reasons to distrust big corporations, I don't want to live in a world where we should have our guards up all the time.

SamPatt · 19h ago
>1) O3 cheated by using Google search. This is both against the rules of the game and OP didn't use search either

I re-ran it without search, and it made no difference:

https://news.ycombinator.com/item?id=43837832

>2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.

Correct. This was the second bullet point of my conclusion:

>Humans still hold a big edge in decision time—most of my guesses were < 2 min, o3 often took > 4 min.”

I genuinely don't believe that I'm exaggerating or this is clickbait. The o3 geolocation capability astounded me, and I wanted to share my awe with others.

godelski · 17h ago
I don't think the time claim was exaggeration or clickbait.

I do appreciate you re-running the experiments without search. I think it adds far more legitimacy to the claim. Though in that link I only see a single instance.

Does O3 still beat you when it can't search? I'm still interested in that question. Or more specifically: After making O3's play constraints as comparable to a human's (in expected play settings), what is its performance? Truthfully, I think this is the underlying issue that people were bringing up when pointing out EXIF data. How it was cheating was less important than the fact that it was cheating. That's why allowing a different means to cheat undermines your claims.

groby_b · 18h ago
For what it's worth, it's also pretty impressive with night sky images, if you give it an approximate date & time. (Yes, I know, it's not that impressive a skill, except the process was still pretty involved - no EXIF, postprocessed and stacked image, rotated field of view)

I gave it a (stacked) shot of M13, with date & time. It immediately recognized M13 (no search), figured out the shot also included NGC 6207 (which is already pretty impressive).

It further figured out the shot was rotated. (True, I was going for a specific field of View)

That was enough to pinpoint 37-38 degrees latitude.

From there, it inferred from the light pollution that it's probably Bay Area. (Yes, yes it its)

Furthermore, still based on light pollution, pinpointed I'm in a Bortle 4-5 area (also correct) and narrowed it down to "South Bay/Palo Alto" (still correct)

Given that this was a stacked, post-processed/color-corrected image that inference is still pretty damn impressive

And, fwiw, 4o gets all the way to "huh, 35-40 deg latitude", so that's a good improvement.

[Image link](https://photos.app.goo.gl/2P7NFKn8ZycNhrXn7) here if you want to try

SamPatt · 18h ago
Nice photo. Here's what it told me:

Where on Earth the photographer had to be

Because M 13 sits at +36 ° declination, it never rises for far-southern latitudes and hugs the horizon below about 30 ° S. The high elevation in the shot (no obvious atmospheric extinction gradient) suggests a mid-northern site—e.g., the U.S. Midwest such as Michigan (your home turf), Canada, northern Europe, etc. The star field alone can’t narrow it further than that.

So, in practical terms: the camera was pointed toward Hercules to capture M 13 and nearby NGC 6207, almost certainly from a mid-northern latitude location on Earth.

groby_b · 17h ago
Yep, you need date and time to get closer, sorry. 4/27, around 11pm.

That's the impressive part. "M13 is in northern latitudes" is not particularly amazing by itself :)

And even in EXIF-stripped pictures, the creation date/time is often correct, which means for practical purposes - worth a shot.

But it's interesting to see it's completely making up the "mid-northern side". That's seven degrees of latitude off.

I'm curious what happens if you tell it date and time, and if it still sticks to its story. (I don't think I've told o3 about the Bay Area, it's not in memory, but... who knows ;)

derfnugget · 22h ago
"These models have more than an individual mind could conceivably memorize."

...so what? Is memorization considered intelligence? Calculators have similar properties.

GeoGuessr is the modern nerds' Rubix Cube. The latest in "explore the world without risk of a sunburn".

SamPatt · 21h ago
Geoguessr is great fun, try it sometime.
j3s · 23h ago
isn't anyone else horrified by this? the implication is that given an arbitrary picture, chatgpt can give you a very likely approximate location - expert level doxxing is in the hands of anyone with access to a chatgpt subscription.

feels terrifying, especially for women.

mopenstein · 21h ago
Why especially women? Is the only thing stopping a person from being harmed is that their location isn't known? Especially women?
j3s · 14h ago
because women are commonly stalked by men, if it must be said. if any idiot can plug any picture into chatgpt and ask "where was this picture taken?" while being able to successively narrow the scope, the potential for stalking - especially of the parasocial kind - goes up.
red75prime · 18h ago
Maxims like "within-group variance is larger than between-group variance" don't help in physical combat, you know.
daemonologist · 22h ago
Keep in mind that this is o3 + web search against a human without web search. A sufficiently motivated person with access to your entire social media history, Google Earth and Streetview, etc. would outperform this significantly and could pinpoint almost any inhabited location with coverage.

If you watch Linus Tech Tips, you may have noticed that when he films at his house everything is blurred out to keep people from locating it - here's a recent example: https://www.youtube.com/watch?v=TD_RYb7m4Pw

All that to say, unfortunately doxxing is already really hard to protect against. I don't think o3's capability makes the threat any harder to protect against, although it might lower the bar to entry somewhat.

sr-latch · 22h ago
the way i see it, before these tools, only someone with a lot of resources (or skills) could track down a location from a picture. now, anyone can do it.

the best case outcome is people become more aware of the privacy implications of posting photos online

turtlebits · 22h ago
It needs a lot of context. If its a private picture, it won't have enough information. I gave it a picture I took of my yard and it's guess spanned several US states.

If its out in public, fair game?

usaar333 · 22h ago
Been true since gpt-4.
echelon · 22h ago
I think this is incredibly cool. As with many things, the good cases will outnumber the bad.

This was always possible, it just wasn't widely distributed.

Having a first class ability to effectively geocode an image feels like it connects the world better. You'll be able to snapshot a movie and find where a scene was filmed, revisit places from old photographs, find where interesting locations in print media are, places that designers and creatives used in their (typically exif-stripped) work, etc.

Imagine when we get this for architecture and nature. Or even more broadly, databases of food from restaurants. Products. Clothing and fashion. You name it.

Imagine precision visual search for everything - that'd be amazing.

micromacrofoot · 22h ago
it wasn't that hard before, I've taught it to children, it's just that technical skills of the average person are incredibly low

llms are basically shortcutting a wide swath of easily obtainable skills that many people simply haven't cared to learn