Language models aren't world models for the same reason languages aren't world models.
Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".
They only have meaning to sentient beings, and that meaning is heavily subjective and contextual.
But there appear to be some who think that we can grasp truth through mechanical symbol manipulation. Perhaps we just need to add a few million more symbols, they think.
If we accept the incompleteness theorem, then there are true propositions that even a super-intelligent AGI would not be able to express, because all it can do is output a series of placeholders. Not to mention the obvious fallacy of knowing super-intelligence when we see it. Can you write a test suite for it?
pron · 1m ago
> Symbols, by definition, only represent a thing. They are not the same as the thing
First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.
But to your philosophical point, assuming there are only a finite number of things and places in the universe - or at least the part of which we care about - why wouldn't they be representable with a finite set of symbols?
What you're rejecting is the Church-Turing thesis [1] (essentially, that all mechanical processes, including that of nature, can be simulated with symbolic computation, although there are weaker and stronger variants). It's okay to reject it, but you should know that not many people do (even some non-orthodox thoughts by Penrose about the brain not being simulatable by an ordinary digital computer still accept that some physical machine - the brain - is able to represent what we're interested in).
> If we accept the incompleteness theorem
There is no if there. It's a theorem. But it's completely irrelevant. It means that there are mathematical propositions that can't be proven or disproven by some system of logic, i.e. by some mechanical means. But if something is in the universe, then it's already been proven by some mechanical process: the mechanics of nature. That means that if some finite set of symbols could represent the laws of nature, then anything in nature can be proven in that logical system.
Which brings us back to the first point: the only way the mechanics of nature cannot be represented by symbols is if they are somehow infinite, i.e. they don't follow some finite set of laws. In other words - there is no physics. Now, that may be true, but if that's the case, then AI is the least of our worries.
Of course, if physics does exist - i.e. the universe is governed by a finite set of laws - that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.
This is missing the lesson of the Yoneda Lemma: symbols are uniquely identified by their relationships with other symbols. If those relationships are represented in text, then in principle they can be inferred and navigated by an LLM.
Some relationships are not represented well in text: tacit knowledge like how hard to twist a bottle cap to get it to come off, etc. We aren't capturing those relationships between all your individual muscles and your brain well in language, so an LLM will miss them or have very approximate versions of them, but... that's always been the problem with tacit knowledge: it's the exact kind of knowledge that's hard to communicate!
scarmig · 7m ago
> If we accept the incompleteness theorem
And, by various universality theorems, a sufficiently large AGI could approximate any sequence of human neuron firings to an arbitrary precision. So if the incompleteness theorem means that neural nets can never find truth, it also means that the human brain can never find truth.
Human neuron firing patterns, after all, only represent a thing; they are not the same as the thing. Your experience of seeing something isn't recreating the physical universe in your head.
overgard · 5m ago
I don't think you can apply the incompleteness theorem like that, LLMs aren't constrained to formal systems
auggierose · 21m ago
First: true propositions (that are not provable) can definitely be expressed, if they couldn't, the incompleteness theorem would not be true ;-)
It would be interesting to know what the percentage of people is, who invoke the incompleteness theorem, and have no clue what it actually says.
Most people don't even know what a proof is, so that cannot be a hindrance on the path to AGI ...
Second: ANY world model that can be digitally represented would be subject to the same argument (if stated correctly), not only LLMs.
bithive123 · 15m ago
I knew someone would call me out on that. I used the wrong word; what I meant was "expressed in a way that would satisfy" which implies proof within the symbolic order being used. I don't claim to be a mathematician or philosopher.
auggierose · 9m ago
Well, you don't get it. The LLM definitely can state propositions "that satisfy", let's just call them true propositions, and that this is not the same as having a proof for it is what the incompleteness theorem says.
Why would you require an LLM to have proof for the things it says? I mean, that would be nice, and I am actually working on that, but it is not anything we would require of humans and/or HN commenters, would we?
chamomeal · 15m ago
I’m not a math guy but the incompleteness theorem applies to formal systems, right? I’ve never thought about LLMs as formal systems, but I guess they are?
bithive123 · 7m ago
Nor am I. I'm not claiming an LLM is a formal system, but it is mechanical and operates on symbols. It can't deal in anything else. That should temper some of the enthusiasm going around.
exe34 · 35m ago
> Language models aren't world models for the same reason languages aren't world models.
> Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".
There is a lot of negatives in there, but I feel like it boils down to a model of a thing is not the thing. Well duh. It's a model. A map is a model.
bithive123 · 9m ago
Right. It's a dead thing that has no independent meaning. It doesn't even exist as a thing except conceputally. The referent is not even another dead thing, but a reality that appears nowhere in the map itself. It may have certain limited usefulness in the practical realm, but expecting it to lead to new insights ignores the fact that it's fundamentally an abstraction of the real, not in relationship to it.
ameliaquining · 1h ago
One thing I appreciated about this post, unlike a lot of AI-skeptic posts, is that it actually makes a concrete falsifiable prediction; specifically, "LLMs will never manage to deal with large code bases 'autonomously'". So in the future we can look back and see whether it was right.
For my part, I'd give 80% confidence that LLMs will be able to do this within two years, without fundamental architectural changes.
shinycode · 18m ago
« autonomously » what happens when subtle updates that are not bugs but change the meaning of some features that might break the workflow on some other external parts of a client’s system ? It happens all the time and, because it’s really hard to have the whole meaning and business rules written and maintained up to date, an LLM might never be able to grasp some meaning.
Maybe if instead of developing code and infrastructures, the whole industry shifts toward only writing impossibly precise spec sheets that make meaning and intent crystal clear then, maybe « autonomously » might be possible to pull off
moduspol · 57m ago
"Deal with" and "autonomously" are doing a lot of heavy lifting there. Cursor already does a pretty good job indexing all the files in a code base in a way that lets it ask questions and get answers pretty quickly. It's just a matter of where you set the goalposts.
ameliaquining · 46m ago
True, there'd be a need to operationalize these things a bit more than is done in the post to have a good advance prediction.
slt2021 · 4m ago
>LLMs will never manage to deal
time to prove hypothesis: infinity years
exe34 · 34m ago
How large? What does "deal" mean here? Autonomously - is that on its own whim, or at the behest of a user?
libraryofbabel · 2d ago
This essay could probably benefit from some engagement with the literature on “interpretability” in LLMs, including the empirical results about how knowledge (like addition) is represented inside the neural network. To be blunt, I’m not sure being smart and reasoning from first principles after asking the LLM a lot of questions and cherry picking what it gets wrong gets to any novel insights at this point. And it already feels a little out date, with LLMs getting gold on the mathematical Olympiad they clearly have a pretty good world model of mathematics. I don’t think cherry-picking a failure to prove 2 + 2 = 4 in the particular specific way the writer wanted to see disproves that at all.
LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.
yosefk · 2d ago
Your being blunt is actually very kind, if you're describing what I'm doing as "being smart and reasoning from first principles"; and I agree that I am not saying something very novel, at most it's slightly contrarian given the current sentiment.
My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.
Let's see how my predictions hold up; I have made enough to look very wrong if they don't.
Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says
WillPostForFood · 1h ago
Your LLM output seems abnormally bad, like you are using old models, bad models, or intentionally poor prompting. I just copied and pasted your Krita example into ChatGPT, and reasonable answer, nothing like what you paraphrased in your post.
This seems like a common theme with these types of articles
marcellus23 · 1h ago
I think it's hard to take any LLM criticism seriously if they don't even specify which model they used. Saying "an LLM model" is totally useless for deriving any kind of conclusion.
p1esk · 1h ago
Yes, I’d be curious about his experience with GPT-5 Thinking model. So far I haven’t seen any blunders from it.
libraryofbabel · 2d ago
I mean yeah, it’s a good essay in that it made me think and try to articulate the gaps, and I’m always looking to read things that push back on AI hype. I usually just skip over the hype blogging.
I think my biggest complaint is that the essay points out flaws in LLM’s world models (totally valid, they do confidently get things wrong and hallucinate in ways that are different, and often more frustrating, from how humans get things wrong) but then it jumps to claiming that there is some fundamental limitation about LLMs that prevents them from forming workable world models. In particular, it strays a bit towards the “they’re just stochastic parrots” critique, e.g. “that just shows the LLM knows to put the words explaining it after the words asking the question.” That just doesn’t seem to hold up in the face of e.g. LLMs getting gold on the Mathematical Olympiad, which features novel questions. If that isn’t a world model of mathematics - being able to apply learned techniques to challenging new questions - then I don’t know what is.
A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.
I do think there’s a lot that the essay gets right. If I was to recast it, I’d put it something like this:
* LLMs have imperfect models of the world which is conditioned by how they’re trained on next token prediction.
* We’ve shown we can drastically improve those world models for particular tasks by reinforcement learning. you kind of allude to this already by talking about how they’ve been “flogged” to be good at math.
* I would claim that there’s no particular reason these RL techniques aren’t extensible in principle to beat all sorts of benchmarks that might look unrealistic now. (Two years ago it would have been an extreme optimist position to say an LLM could get gold on the mathematical Olympiad, and most LLM skeptics would probably have said it could never happen.)
* Of course it’s very expensive, so most world models LLMs have won’t get the RL treatment and so will be full of gaps, especially for things that aren’t amenable to RL. It’s good to beware of this.
I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. (If an LLM was given tons of RL training on that codebase, it could build a better world model, but that’s expensive and very challenging to set up.) This problem is hinted at in your essay, but the lack of on-the-job learning isn’t centered. But it’s the real elephant in the room with LLMs and the one the boosters don’t really have an answer to.
Anyway thanks for writing this and responding!
yosefk · 2d ago
I'm not saying that LLMs can't learn about the world - I even mention how they obviously do it, even at the learned embeddings level. I'm saying that they're not compelled by their training objective to learn about the world and in many cases they clearly don't, and I don't see how to characterize the opposite cases in a more useful way than "happy accidents."
I don't really know how they are made "good at math," and I'm not that good at math myself. With code I have a better gut feeling of the limitations. I do think that you could throw them off terribly with unusual math quastions to show that what they learned isn't math, but I'm not the guy to do it; my examples are about chess and programming where I am more qualified to do it. (You could say that my question about the associativity of blending and how caching works sort of shows that it can't use the concept of associativity in novel situations; not sure if this can be called an illustration of its weakness at math)
calf · 12s ago
What is your reason that the training process does not cause world representations to form in an LLM? What about the well-known Othello paper from a year or two ago? They formed board game states or something. That's enough to be a very partial world model.
AyyEye · 2d ago
With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever. That addition (something which only takes a few gates in digital logic) happens to be overfit into a few nodes on multi-billion node networks is hardly a surprise to anyone except the most religious of AI believers.
BobbyJo · 1d ago
The core issue there isn't that the LLM isn't building internal models to represent its world, it's that its world is limited to tokens. Anything not represented in tokens, or token relationships, can't be modeled by the LLM, by definition.
It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.
In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.
LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.
xigoi · 8h ago
> It's like asking a blind person to count the number of colors on a car.
I presume if I asked a blind person to count the colors on a car, they would reply “sorry, I am blind, so I can’t answer this question”.
vrighter · 1d ago
That is just not a solid argument. There are countless examples of LLMs splitting "blueberry" into "b l u e b e r r y", which would contain one token per letter. And then they still manage to get it wrong.
Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.
williamcotton · 1h ago
I don’t solve math problems with my poetry writing skills:
> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.
Is this a real defect, or some historical thing?
I just asked GPT-5:
How many "B"s in "blueberry"?
and it replied:
There are 2 — the letter b appears twice in "blueberry".
I also asked it how many Rs in Carrot, and how many Ps in Pineapple, amd it answered both questions correctly too.
libraryofbabel · 2d ago
It’s a historical thing that people still falsely claim is true, bizarrely without trying it on the latest models. As you found, leading LLMs don’t have a problem with it anymore.
pydry · 2d ago
Depends how you define historical. If by historical you mean more than two days ago then, yeah, it's ancient history.
Perhaps they have a hot fix that special cases HN complaints?
AyyEye · 1d ago
They clearly RLHF out the embarrassing cases and make cheating on benchmarks into a sport.
nosioptar · 2d ago
Shouldn't the correct answer be that there is not a "B" in "blueberry"?
yosefk · 2d ago
Actually I forgive them those issues that stem from tokenization. I used to make fun at them for listing datum as a noun whose plural form ends with an i, but once I learned about how tokenization works, I no longer do it - it feels like mocking a person's intelligence because of a speech impediment or something... I am very kind to these things, I think
libraryofbabel · 1d ago
> they clearly don't have any world model whatsoever
Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.
simiones · 12h ago
> where it certainly hadn’t seen the questions before?
What are you basing this certainty on?
And even if you're right that the specific questions had not come up, it may still be that the questions from the math olympiad were rehashes of similar questions in other texts, or happened to correspond well to a composition of some other problems that were part of the training set, such that the LLM could 'pick up' on the similarity.
It's also possible that the LLM was specifically trained on similar problems, or may even have a dedicated sub-net or tool for it. Still impressive, but possibly not in a way that generalizes even to math like one might think based on the press releases.
The papers from Anthropic on interpretability are pretty good. They look at how certain concepts are encoded within the LLM.
frankfrank13 · 56m ago
Great quote at the end that I think I resonate a lot with:
> Feeding these algorithms gobs of data is another example of how an approach that must be fundamentally incorrect at least in some sense, as evidenced by how data-hungry it is, can be taken very far by engineering efforts — as long as something is useful enough to fund such efforts and isn’t outcompeted by a new idea, it can persist.
o_nate · 1d ago
What with this and your previous post about why sometimes incompetent management leads to better outcomes, you are quickly becoming one of my favorite tech bloggers. Perhaps I enjoyed the piece so much because your conclusions basically track mine. (I'm a software developer who has dabbled with LLMs, and has some hand-wavey background on how they work, but otherwise can claim no special knowledge.) Also your writing style really pops. No one would accuse your post of having been generated by an LLM.
yosefk · 1d ago
thank you for your kind words!
keeda · 1d ago
That whole bit about color blending and transparency and LLMs "not knowing colors" is hard to believe. I am literally using LLMs every day to write image-processing and computer vision code using OpenCV. It seamlessly reasons across a range of concepts like color spaces, resolution, compression artifacts, filtering, segmentation and human perception. I mean, removing the alpha from a PNG image was a preprocessing step it wrote by itself as part of a larger task I had given it, so it certainly understands transparency.
I even often describe the results e.g. "this fails when in X manner when the image has grainy regions" and it figures out what is going on, and adapts the code accordingly. (It works with uploading actual images too, but those consume a lot of tokens!)
And all this in a rather niche domain that seems relatively less explored. The images I'm working with are rather small and low-resolution, which most literature does not seem to contemplate much. It uses standard techniques well known in the art, but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.
If it can reason about images and vision and write working code for niche problems I throw at it, whether it "knows" colors in the human sense is a purely philosophical question.
geraneum · 1h ago
> it wrote by itself as part of a larger task I had given it, so it certainly understands transparency
Or it’s a common step or a known pattern or combination of steps that is prevalent in its training data for certain input. I’m guessing you don’t know what’s exactly in the training sets. I don’t know either. They don’t tell ;)
> but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.
We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers. It doesn’t mean what you are doing specifically is done in this exact way before, rather the patterns adapted and the approach may not be one of their kind.
> is a purely philosophical question
It is indeed. A question we need to ask ourselves.
ej88 · 2d ago
This article is interesting but pretty shallow.
0(?): there’s no provided definition of what a ‘world model’ is. Is it playing chess? Is it remembering facts like how computers use math to blend Colors? If so, then ChatGPT: https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a world model right?
1. The author seems to conflate context windows with failing to model the world in the chess example. I challenge them to ask a SOTA model with an image of a chess board or notation and ask it about the position. It might not give you GM level analysis but it definitely has a model of what’s going on.
2. Without explaining which LLM they used or sharing the chats these examples are just not valuable. The larger and better the model, the better its internal representation of the world.
You can try it yourself. Come up with some question involving interacting with the world and / or physics and ask GPT-5 Thinking. It’s got a pretty good understanding of how things work!
A "world model" depends on the context which defines which world the problem is in. For chess, which moves are legal and needing to know where the pieces are to make legal moves are parts of the world model. For alpha blending, it being a mathematical operation and the visibility of a background given the transparency of the foreground are parts of the world model.
The examples are from all the major commercial American LLMs as listed in a sister comment.
You seem to conflate context windows with tracking chess pieces. The context windows are more than large enough to remember 10 moves. The model should either track the pieces, or mention that it would be playing blindfold chess absent a board to look at and it isn't good at this, so could you please list the position after every move to make it fair, or it doesn't know what it's doing; it's demonstrably the latter.
deadbabe · 2d ago
Don’t: use LLMs to play chess against you
Do: use LLMs to talk shit to you while a real chess AI plays chess against you.
The above applies to a lot of things besides chess, and illustrates a proper application of LLMs.
skeledrew · 1d ago
Agree in general with most of the points, except
> but because I know you and I get by with less.
Actually we got far more data and training than any LLM. We've been gathering and processing sensory data every second at least since birth (more processing than gathering when asleep), and are only really considered fully intelligent in our late teens to mid-20s.
helloplanets · 16m ago
Don't forget the millions of years of pre-training! ;)
lordnacho · 2d ago
Here's what LLMs remind me of.
When I went to uni, we had tutorials several times a week. Two students, one professor, going over whatever was being studied that week. The professor would ask insightful questions, and the students would try to answer.
Sometimes, I would answer a question correctly without actually understanding what I was saying. I would be spewing out something that I had read somewhere in the huge pile of books, and it would be a sentence, with certain special words in it, that the professor would accept as an answer.
But I would sometimes have this weird feeling of "hmm I actually don't get it" regardless. This is kinda what the tutorial is for, though. With a bit more prodding, the prof will ask something that you genuinely cannot produce a suitable word salad for, and you would be found out.
In math-type tutorials it would be things like realizing some equation was useful for finding an answer without having a clue about what the equation actually represented.
In economics tutorials it would be spewing out words about inflation or growth or some particular author but then having nothing to back up the intuition.
This is what I suspect LLMs do. They can often be very useful to someone who actually has the models in their minds, but not the data to hand. You may have forgotten the supporting evidence for some position, or you might have missed some piece of the argument due to imperfect memory. In these cases, LLM is fantastic as it just glues together plausible related words for you to examine.
The wheels come off when you're not an expert. Everything it says will sound plausible. When you challenge it, it just apologizes and pretends to correct itself.
jonplackett · 2d ago
I just tried a few things that are simple and a world model would probably get right. Eg
Question to GPT5:
I am looking straight on to some objects. Looking parallel to the ground.
In front of me I have a milk bottle, to the right of that is a Coca-Cola bottle. To the right of that is a glass of water. And to the right of that there’s a cherry. Behind the cherry there’s a cactus and to the left of that there’s a peanut. Everything is spaced evenly. Can I see the peanut?
Answer (after choosing thinking mode)
No.
The cactus is directly behind the cherry (front row order: milk, Coke, water, cherry). “To the left of that” puts the peanut behind the glass of water. Since you’re looking straight on, the glass sits in front and occludes the peanut.
It doesn’t consider transparency until you mention it, then apologises and says it didn’t think of transparency
RugnirViking · 2d ago
this seems like a strange riddle. In my mind I was thinking that regardless of the glass, all of the objects can be seen (due to perspective, and also the fact you mentioned the locations, meaning you're aware of them).
It seems to me it would only actually work in an orthographic perspective, which is not how our reality works
jonplackett · 1d ago
You can tell from the response it does understand the riddle just fine, it just gets it wrong.
rpdillon · 1h ago
Have you asked five adults this riddle? I suspect at least two of them would get it wrong or have some uncertainty about whether or not the peanut was visible.
xg15 · 1h ago
This. Was also thinking "yes" first because of the glass of water, transparency, etc, but then got unsure: The objects might be spaced so widely that the milk or coke bottle would obscure the view due to perspective - or the peanut would simply end up outside the viewer's field of vision.
Shows that even if you have a world model, it might not be the right one.
optimalsolver · 8h ago
Gemini 2.5 Pro gets this correct on the first attempt, and specifically points out the transparency of the glass of water.
As far as I can tell they don’t say which LLM they used which is kind of a shame as there is a huge range of capabilities even in newly released LLMs (e.g. reasoning vs not).
yosefk · 2d ago
ChatGPT, Claude, Grok and Google AI Overviews, whatever powers the latter, were all used in one or more of these examples, in various configurations. I think they can perform differently, and I often try more than one when the 1st try doesn't work great. I don't think there's any fundamental difference in the principle of their operation, and I think there never will be - there will be another major breakthrough
imenani · 2d ago
Each of these models has a thinking/reasoning variant and a default non-thinking variant. I would expect the reasoning variants (o3 or “GPT5 Thinking”, Gemini DeepThink, Claude with Extended Thinking, etc) to do better at this. I think there is also some chance that in their reasoning traces they may display something you might see as closer to world modelling. In particular, you might find them explicitly tracking positions of pieces and checking validity.
red75prime · 2d ago
My hypothesis is that a model fails to switch into a deep thinking mode (if it has it) and blurts whatever it got from all the internet data during autoregressive training. I tested it with alpha-blending example. Gemini 2.5 flash - fails, Gemini 2.5 pro - succeeds.
How presence/absence of a world model, er, blends into all this? I guess "having a consistent world model at all times" is an incorrect description of humans, too. We seem to have it because we have mechanisms to notice errors, correct errors, remember the results, and use the results when similar situations arise, while slowly updating intuitions about the world to incorporate changes.
The current models lack "remember/use/update" parts.
red75prime · 1d ago
> I don't think there's any fundamental difference in the principle of their operation
Yeah, they seem to be a subject to the universal approximation theorem (it needs to be checked more thoroughly, but I think we can build a transformer that is equivalent to any given fully-connected multilayered network).
That is at a certain size they can do anything a human can do at a certain point in their life (that is with no additional training) regardless of whether humans have world models and what those model are on the neuronal level.
But there are additional nuances that are related to their architectures and training regimes. And practical questions of the required size.
lowsong · 2d ago
It doesn't matter. These limitations are fundamental to LLMs, so all of them that will ever be made suffer from these problems.
This is interesting. The "professional level" rating of <1800 isn't, but still.
However:
"A significant Elo rating jump occurs when the model’s Legal Move accuracy reaches
99.8%. This increase is due to the reduction in errors after the model learns to generate legal moves,
reinforcing that continuous error correction and
learning the correct moves significantly improve ELO"
You should be able to reach the move legality of around 100% with few resources spent on it. Failing to do so means that it has not learned a model of what chess is, at some basic level. There is virtually no challenge in making legal moves.
rpdillon · 1h ago
> Failing to do so means that it has not learned a model of what chess is, at some basic level.
I'm not sure about this. Among a standard amateur set of chess players, how often when they lack any kind of guidance from a computer do they attempt to make a move that is illegal? I played chess for years throughout elementary, middle and high school, and I would easily say that even after hundreds of hours of playing, I might make two mistakes out of a thousand moves where the move was actually illegal, often because I had missed that moving that piece would continue to leave me in check due to a discovered check that I had missed.
It's hard to conclude from that experience that players that are amateurs lack even a basic model of chess.
lostmsu · 17h ago
> r4rk1 pp6 8 4p2Q 3n4 4N3 qP5P 2KRB3 w — — 3 27
Can you say 100% you can generate a good next move (example from the paper) without using tools, and will never accidentally make a mistake and give an illegal move?
1970-01-01 · 50m ago
I'm surprised the models haven't been enshittified by capitalism. I think in a few years we're going to see lightning-fast LLMs generating better output compared to what we're seeing today. But it won't be 1000x better, it will be 10x better, 10x faster, and completely enshittified with ads and clickbait links. Enjoy ChatGPT while it lasts.
This is significant in general because I personally would love to get these things to code-switch into "hackernews poster" or "writer for the Economist" or "academic philosopher", but I think the "chat" format makes it impossible. The inaccessibility of this makes me want to host my own LLM...
og_kalu · 2d ago
Yes LLMs can play chess and yes they can model it fine
King Frederick, the great of Prussia had a very fine army, and none of the soldiers in it were finer than Giant Guards, who were all extremely tall men. It was difficult to find enough soldiers for these Guards, as there were not many men who were tall enough.
Frederick had made it a rule that no soldiers who did not speak German could be admitted to the Giant Guards, and this made the work of the officers who had to find men for them even more difficult. When they had to choose between accepting or refusing a really tall man who knew no German, the officers used to accept him, and then teach him enough. German to be able to answer if the King questioned him.
Frederick, sometimes, used to visit the men who were on guard around his castle at night to see that they were doing their job properly, and it was his habit to ask each new one that he saw three questions: “How old are you?” “How long have you been in my army?” and “Are you satisfied with your food and your conditions?”
The offices of the Giant Guards therefore used to teach new soldiers who did not know German the answers to these three questions.
One day, however, the King asked a new soldier the questions in a different order, he began with, “How long have you been in my army?” The young soldier immediately answered, “Twenty – two years, Your Majesty”. Frederick was very surprised. “How old are you then?”, he asked the soldier. “Six months, Your Majesty”, came the answer. At this Frederick became angry, “Am I a fool, or are you one?” he asked. “Both, Your Majesty”, the soldier answered politely.
Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".
They only have meaning to sentient beings, and that meaning is heavily subjective and contextual.
But there appear to be some who think that we can grasp truth through mechanical symbol manipulation. Perhaps we just need to add a few million more symbols, they think.
If we accept the incompleteness theorem, then there are true propositions that even a super-intelligent AGI would not be able to express, because all it can do is output a series of placeholders. Not to mention the obvious fallacy of knowing super-intelligence when we see it. Can you write a test suite for it?
First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.
But to your philosophical point, assuming there are only a finite number of things and places in the universe - or at least the part of which we care about - why wouldn't they be representable with a finite set of symbols?
What you're rejecting is the Church-Turing thesis [1] (essentially, that all mechanical processes, including that of nature, can be simulated with symbolic computation, although there are weaker and stronger variants). It's okay to reject it, but you should know that not many people do (even some non-orthodox thoughts by Penrose about the brain not being simulatable by an ordinary digital computer still accept that some physical machine - the brain - is able to represent what we're interested in).
> If we accept the incompleteness theorem
There is no if there. It's a theorem. But it's completely irrelevant. It means that there are mathematical propositions that can't be proven or disproven by some system of logic, i.e. by some mechanical means. But if something is in the universe, then it's already been proven by some mechanical process: the mechanics of nature. That means that if some finite set of symbols could represent the laws of nature, then anything in nature can be proven in that logical system. Which brings us back to the first point: the only way the mechanics of nature cannot be represented by symbols is if they are somehow infinite, i.e. they don't follow some finite set of laws. In other words - there is no physics. Now, that may be true, but if that's the case, then AI is the least of our worries.
Of course, if physics does exist - i.e. the universe is governed by a finite set of laws - that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.
[1]: https://plato.stanford.edu/entries/church-turing/
This is missing the lesson of the Yoneda Lemma: symbols are uniquely identified by their relationships with other symbols. If those relationships are represented in text, then in principle they can be inferred and navigated by an LLM.
Some relationships are not represented well in text: tacit knowledge like how hard to twist a bottle cap to get it to come off, etc. We aren't capturing those relationships between all your individual muscles and your brain well in language, so an LLM will miss them or have very approximate versions of them, but... that's always been the problem with tacit knowledge: it's the exact kind of knowledge that's hard to communicate!
And, by various universality theorems, a sufficiently large AGI could approximate any sequence of human neuron firings to an arbitrary precision. So if the incompleteness theorem means that neural nets can never find truth, it also means that the human brain can never find truth.
Human neuron firing patterns, after all, only represent a thing; they are not the same as the thing. Your experience of seeing something isn't recreating the physical universe in your head.
It would be interesting to know what the percentage of people is, who invoke the incompleteness theorem, and have no clue what it actually says.
Most people don't even know what a proof is, so that cannot be a hindrance on the path to AGI ...
Second: ANY world model that can be digitally represented would be subject to the same argument (if stated correctly), not only LLMs.
Why would you require an LLM to have proof for the things it says? I mean, that would be nice, and I am actually working on that, but it is not anything we would require of humans and/or HN commenters, would we?
There is a lot of negatives in there, but I feel like it boils down to a model of a thing is not the thing. Well duh. It's a model. A map is a model.
For my part, I'd give 80% confidence that LLMs will be able to do this within two years, without fundamental architectural changes.
time to prove hypothesis: infinity years
LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.
My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.
Let's see how my predictions hold up; I have made enough to look very wrong if they don't.
Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says
https://imgur.com/a/O9CjiJY
I think my biggest complaint is that the essay points out flaws in LLM’s world models (totally valid, they do confidently get things wrong and hallucinate in ways that are different, and often more frustrating, from how humans get things wrong) but then it jumps to claiming that there is some fundamental limitation about LLMs that prevents them from forming workable world models. In particular, it strays a bit towards the “they’re just stochastic parrots” critique, e.g. “that just shows the LLM knows to put the words explaining it after the words asking the question.” That just doesn’t seem to hold up in the face of e.g. LLMs getting gold on the Mathematical Olympiad, which features novel questions. If that isn’t a world model of mathematics - being able to apply learned techniques to challenging new questions - then I don’t know what is.
A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.
I do think there’s a lot that the essay gets right. If I was to recast it, I’d put it something like this:
* LLMs have imperfect models of the world which is conditioned by how they’re trained on next token prediction.
* We’ve shown we can drastically improve those world models for particular tasks by reinforcement learning. you kind of allude to this already by talking about how they’ve been “flogged” to be good at math.
* I would claim that there’s no particular reason these RL techniques aren’t extensible in principle to beat all sorts of benchmarks that might look unrealistic now. (Two years ago it would have been an extreme optimist position to say an LLM could get gold on the mathematical Olympiad, and most LLM skeptics would probably have said it could never happen.)
* Of course it’s very expensive, so most world models LLMs have won’t get the RL treatment and so will be full of gaps, especially for things that aren’t amenable to RL. It’s good to beware of this.
I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. (If an LLM was given tons of RL training on that codebase, it could build a better world model, but that’s expensive and very challenging to set up.) This problem is hinted at in your essay, but the lack of on-the-job learning isn’t centered. But it’s the real elephant in the room with LLMs and the one the boosters don’t really have an answer to.
Anyway thanks for writing this and responding!
I don't really know how they are made "good at math," and I'm not that good at math myself. With code I have a better gut feeling of the limitations. I do think that you could throw them off terribly with unusual math quastions to show that what they learned isn't math, but I'm not the guy to do it; my examples are about chess and programming where I am more qualified to do it. (You could say that my question about the associativity of blending and how caching works sort of shows that it can't use the concept of associativity in novel situations; not sure if this can be called an illustration of its weakness at math)
It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.
In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.
I think a good example of LLMs building models internally is this: https://rohinmanvi.github.io/GeoLLM/
LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.
I presume if I asked a blind person to count the colors on a car, they would reply “sorry, I am blind, so I can’t answer this question”.
Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.
https://chatgpt.com/share/689ba837-8ae0-8013-96d2-7484088f27...
Is this a real defect, or some historical thing?
I just asked GPT-5:
and it replied: I also asked it how many Rs in Carrot, and how many Ps in Pineapple, amd it answered both questions correctly too.Sibling poster is probably mistakenly thinking of the strawberry issue from 2024 on older LLM models.
https://kieranhealy.org/blog/archives/2025/08/07/blueberry-h...
Perhaps they have a hot fix that special cases HN complaints?
Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.
What are you basing this certainty on?
And even if you're right that the specific questions had not come up, it may still be that the questions from the math olympiad were rehashes of similar questions in other texts, or happened to correspond well to a composition of some other problems that were part of the training set, such that the LLM could 'pick up' on the similarity.
It's also possible that the LLM was specifically trained on similar problems, or may even have a dedicated sub-net or tool for it. Still impressive, but possibly not in a way that generalizes even to math like one might think based on the press releases.
> Feeding these algorithms gobs of data is another example of how an approach that must be fundamentally incorrect at least in some sense, as evidenced by how data-hungry it is, can be taken very far by engineering efforts — as long as something is useful enough to fund such efforts and isn’t outcompeted by a new idea, it can persist.
I even often describe the results e.g. "this fails when in X manner when the image has grainy regions" and it figures out what is going on, and adapts the code accordingly. (It works with uploading actual images too, but those consume a lot of tokens!)
And all this in a rather niche domain that seems relatively less explored. The images I'm working with are rather small and low-resolution, which most literature does not seem to contemplate much. It uses standard techniques well known in the art, but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.
If it can reason about images and vision and write working code for niche problems I throw at it, whether it "knows" colors in the human sense is a purely philosophical question.
Or it’s a common step or a known pattern or combination of steps that is prevalent in its training data for certain input. I’m guessing you don’t know what’s exactly in the training sets. I don’t know either. They don’t tell ;)
> but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.
We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers. It doesn’t mean what you are doing specifically is done in this exact way before, rather the patterns adapted and the approach may not be one of their kind.
> is a purely philosophical question
It is indeed. A question we need to ask ourselves.
0(?): there’s no provided definition of what a ‘world model’ is. Is it playing chess? Is it remembering facts like how computers use math to blend Colors? If so, then ChatGPT: https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a world model right?
1. The author seems to conflate context windows with failing to model the world in the chess example. I challenge them to ask a SOTA model with an image of a chess board or notation and ask it about the position. It might not give you GM level analysis but it definitely has a model of what’s going on.
2. Without explaining which LLM they used or sharing the chats these examples are just not valuable. The larger and better the model, the better its internal representation of the world.
You can try it yourself. Come up with some question involving interacting with the world and / or physics and ask GPT-5 Thinking. It’s got a pretty good understanding of how things work!
https://chatgpt.com/s/t_689903b03e6c8191b7ce1b85b1698358
The examples are from all the major commercial American LLMs as listed in a sister comment.
You seem to conflate context windows with tracking chess pieces. The context windows are more than large enough to remember 10 moves. The model should either track the pieces, or mention that it would be playing blindfold chess absent a board to look at and it isn't good at this, so could you please list the position after every move to make it fair, or it doesn't know what it's doing; it's demonstrably the latter.
Do: use LLMs to talk shit to you while a real chess AI plays chess against you.
The above applies to a lot of things besides chess, and illustrates a proper application of LLMs.
> but because I know you and I get by with less.
Actually we got far more data and training than any LLM. We've been gathering and processing sensory data every second at least since birth (more processing than gathering when asleep), and are only really considered fully intelligent in our late teens to mid-20s.
When I went to uni, we had tutorials several times a week. Two students, one professor, going over whatever was being studied that week. The professor would ask insightful questions, and the students would try to answer.
Sometimes, I would answer a question correctly without actually understanding what I was saying. I would be spewing out something that I had read somewhere in the huge pile of books, and it would be a sentence, with certain special words in it, that the professor would accept as an answer.
But I would sometimes have this weird feeling of "hmm I actually don't get it" regardless. This is kinda what the tutorial is for, though. With a bit more prodding, the prof will ask something that you genuinely cannot produce a suitable word salad for, and you would be found out.
In math-type tutorials it would be things like realizing some equation was useful for finding an answer without having a clue about what the equation actually represented.
In economics tutorials it would be spewing out words about inflation or growth or some particular author but then having nothing to back up the intuition.
This is what I suspect LLMs do. They can often be very useful to someone who actually has the models in their minds, but not the data to hand. You may have forgotten the supporting evidence for some position, or you might have missed some piece of the argument due to imperfect memory. In these cases, LLM is fantastic as it just glues together plausible related words for you to examine.
The wheels come off when you're not an expert. Everything it says will sound plausible. When you challenge it, it just apologizes and pretends to correct itself.
Question to GPT5: I am looking straight on to some objects. Looking parallel to the ground.
In front of me I have a milk bottle, to the right of that is a Coca-Cola bottle. To the right of that is a glass of water. And to the right of that there’s a cherry. Behind the cherry there’s a cactus and to the left of that there’s a peanut. Everything is spaced evenly. Can I see the peanut?
Answer (after choosing thinking mode)
No. The cactus is directly behind the cherry (front row order: milk, Coke, water, cherry). “To the left of that” puts the peanut behind the glass of water. Since you’re looking straight on, the glass sits in front and occludes the peanut.
It doesn’t consider transparency until you mention it, then apologises and says it didn’t think of transparency
It seems to me it would only actually work in an orthographic perspective, which is not how our reality works
Shows that even if you have a world model, it might not be the right one.
https://g.co/gemini/share/362506056ddb
Time to get the ol' goalpost-moving gloves out.
How presence/absence of a world model, er, blends into all this? I guess "having a consistent world model at all times" is an incorrect description of humans, too. We seem to have it because we have mechanisms to notice errors, correct errors, remember the results, and use the results when similar situations arise, while slowly updating intuitions about the world to incorporate changes.
The current models lack "remember/use/update" parts.
Yeah, they seem to be a subject to the universal approximation theorem (it needs to be checked more thoroughly, but I think we can build a transformer that is equivalent to any given fully-connected multilayered network).
That is at a certain size they can do anything a human can do at a certain point in their life (that is with no additional training) regardless of whether humans have world models and what those model are on the neuronal level.
But there are additional nuances that are related to their architectures and training regimes. And practical questions of the required size.
However:
"A significant Elo rating jump occurs when the model’s Legal Move accuracy reaches 99.8%. This increase is due to the reduction in errors after the model learns to generate legal moves, reinforcing that continuous error correction and learning the correct moves significantly improve ELO"
You should be able to reach the move legality of around 100% with few resources spent on it. Failing to do so means that it has not learned a model of what chess is, at some basic level. There is virtually no challenge in making legal moves.
I'm not sure about this. Among a standard amateur set of chess players, how often when they lack any kind of guidance from a computer do they attempt to make a move that is illegal? I played chess for years throughout elementary, middle and high school, and I would easily say that even after hundreds of hours of playing, I might make two mistakes out of a thousand moves where the move was actually illegal, often because I had missed that moving that piece would continue to leave me in check due to a discovered check that I had missed.
It's hard to conclude from that experience that players that are amateurs lack even a basic model of chess.
Can you say 100% you can generate a good next move (example from the paper) without using tools, and will never accidentally make a mistake and give an illegal move?
Sota LLMs do play legal moves in chess, I don't why the article seem to say otherwise.
https://dynomight.net/more-chess/
This is significant in general because I personally would love to get these things to code-switch into "hackernews poster" or "writer for the Economist" or "academic philosopher", but I think the "chat" format makes it impossible. The inaccessibility of this makes me want to host my own LLM...
https://arxiv.org/pdf/2403.15498v2
https://en.wikipedia.org/wiki/%22Good_day,_fellow!%22_%22Axe...
King Frederick, the great of Prussia had a very fine army, and none of the soldiers in it were finer than Giant Guards, who were all extremely tall men. It was difficult to find enough soldiers for these Guards, as there were not many men who were tall enough.
Frederick had made it a rule that no soldiers who did not speak German could be admitted to the Giant Guards, and this made the work of the officers who had to find men for them even more difficult. When they had to choose between accepting or refusing a really tall man who knew no German, the officers used to accept him, and then teach him enough. German to be able to answer if the King questioned him.
Frederick, sometimes, used to visit the men who were on guard around his castle at night to see that they were doing their job properly, and it was his habit to ask each new one that he saw three questions: “How old are you?” “How long have you been in my army?” and “Are you satisfied with your food and your conditions?”
The offices of the Giant Guards therefore used to teach new soldiers who did not know German the answers to these three questions.
One day, however, the King asked a new soldier the questions in a different order, he began with, “How long have you been in my army?” The young soldier immediately answered, “Twenty – two years, Your Majesty”. Frederick was very surprised. “How old are you then?”, he asked the soldier. “Six months, Your Majesty”, came the answer. At this Frederick became angry, “Am I a fool, or are you one?” he asked. “Both, Your Majesty”, the soldier answered politely.
https://archive.org/details/advancedstoriesf0000hill
As in, an alien could teach one of our AIs their language faster than an alien could teach an human, and vice versa..
..though the potential for catastrophic disasters is also great there lol