LLMs Aren't World Models

31 ingve 15 8/10/2025, 11:40:14 AM yosefk.com ↗

Comments (15)

libraryofbabel · 45m ago
This essay could probably benefit from some engagement with the literature on “interpretability” in LLMs, including the empirical results about how knowledge (like addition) is represented inside the neural network. To be blunt, I’m not sure being smart and reasoning from first principles after asking the LLM a lot of questions and cherry picking what it gets wrong gets to any novel insights at this point. And it already feels a little out date, with LLMs getting gold on the mathematical Olympiad they clearly have a pretty good world model of mathematics. I don’t think cherry-picking a failure to prove 2 + 2 = 4 in the particular specific way the writer wanted to see disproves that at all.

LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.

AyyEye · 24m ago
With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever. That addition (something which only takes a few gates in digital logic) happens to be overfit into a few nodes on multi-billion node networks is hardly a surprise to anyone except the most religious of AI believers.
andyjohnson0 · 36s ago
> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.

I just asked GPT-5:

    How many "B"s in "blueberry"?
and it replied:

    There are 2 — the letter b appears twice in "blueberry".
yosefk · 19m ago
Actually I forgive them those issues that stem from tokenization. I used to make fun at them for listing datum as a noun whose plural form ends with an i, but once I learned about how tokenization works, I no longer do it - it feels like mocking a person's intelligence because of a speech impediment or something... I am very kind to these things, I think
yosefk · 34m ago
Your being blunt is actually very kind, if you're describing what I'm doing as "being smart and reasoning from first principles"; and I agree that I am not saying something very novel, at most it's slightly contrarian given the current sentiment.

My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.

Let's see how my predictions hold up; I have made enough to look very wrong if they don't.

Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says

armchairhacker · 34m ago
Any suggestions from this literature?
GaggiX · 1m ago
https://www.youtube.com/watch?v=LtG0ACIbmHw

Sota LLMs do play legal moves in chess, I don't why the article seem to say otherwise.

imenani · 13m ago
As far as I can tell they don’t say which LLM they used which is kind of a shame as there is a huge range of capabilities even in newly released LLMs (e.g. reasoning vs not).
yosefk · 7m ago
ChatGPT, Claude, Grok and Google AI Overviews, whatever powers the latter, were all used in one or more of these examples, in various configurations. I think they can perform differently, and I often try more than one when the 1st try doesn't work great. I don't think there's any fundamental difference in the principle of their operation, and I think there never will be - there will be another major breakthrough
og_kalu · 8m ago
Yes LLMs can play chess and yes they can model it fine

https://arxiv.org/pdf/2403.15498v2

rishi_devan · 25m ago
Haha. I enjoyed that Soviet-era joke at the end.
svantana · 16m ago
Yes, I hadn't heard that before. It's similar in spirit to this norwegian folk tale about a deaf man guessing what someone is saying to him:

https://en.wikipedia.org/wiki/%22Good_day,_fellow!%22_%22Axe...

deadbabe · 19m ago
Don’t: use LLMs to play chess against you

Do: use LLMs to talk shit to you while a real chess AI plays chess against you.

The above applies to a lot of things besides chess, and illustrates a proper application of LLMs.

t0md4n · 1h ago
yosefk · 1h ago
This is interesting. The "professional level" rating of <1800 isn't, but still.

However:

"A significant Elo rating jump occurs when the model’s Legal Move accuracy reaches 99.8%. This increase is due to the reduction in errors after the model learns to generate legal moves, reinforcing that continuous error correction and learning the correct moves significantly improve ELO"

You should be able to reach the move legality of around 100% with few resources spent on it. Failing to do so means that it has not learned a model of what chess is, at some basic level. There is virtually no challenge in making legal moves.