LLMs' "simulated reasoning" abilities are a brittle mirage

93 blueridge 56 8/12/2025, 5:52:47 AM arstechnica.com ↗

Comments (56)

NitpickLawyer · 5h ago
> Without specification, we employ a decoder-only language model GPT2 (Radford et al., 2019) with a configuration of 4 layers, 32 hidden dimensions, and 4 attention heads.

Yeah, ok. The research is interesting, warranted, but writing an article about it, and leading with the conclusions gathered from toy models and implying this generalises to production LLMs is useless.

We've been here before with small models. Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results.

Research in this area is good, and needed. Mainly to understand limitations, discover if there are any scale levels where "emergent" stuff appears and so on. But writing articles based on incipient research, based on tiny models is not worth the effort.

willvarfar · 4h ago
Doing analysis on small models or small data is perfectly valid if the results extrapolate to large models. Which is why right now we're looking at new research papers that are still listing the same small datasets and comparing to the same small models that papers five years ago did.
NitpickLawyer · 4h ago
I have nothing against researching this, I think it's important. My main issue is with articles choosing to grab a "conclusion" and imply it extrapolates to larger models, without any support for that. They are going for the catchy title first, fine-print be damned.
willvarfar · 4h ago
I was just at the KDD conference and the general consensus agreed with this paper. There was only one keynoter who just made the assumption that LLMs are associated with reasoning, which was jarring as the previous keynoter had just explained at length why we need a neuro-symbolic approach instead.

The thing is, I think the current companies making LLMs are _not_ trying to be correct or right. They are just trying to hide it better. In the business future for AI the coding stuff that we focus on on HN - how AI can help/impact us - is just a sideline.

The huge-money business future of LLMs is to end consumers not creators and it is product and opinion placement and their path to that is to friendship. They want their assistant to be your friend, then your best friend, then your only friend, then your lover. If the last 15 years of social media has been about discord and polarisation to get engagement, the next 15 will be about friendship and love even though that leads to isolation.

None of this needs the model to grow strong reasoning skills. That's not where the real money is. And CoT - whilst super great - is just as effective if it's hiding better that its giving you the wrong answer (by being more internally consistent) than if its giving you a better answer?

mdp2021 · 4h ago
> None of this needs the model to grow strong reasoning skills. That's not where the real money is

"And the world is more and more complex, and the administrations are less and less prepared"

(~~ Henry Kissinger)

bsaul · 2h ago
"as the previous keynoter had just explained at length why we need a neuro-symbolic approach instead"

Do you have a link to the video for that talk ?

calf · 1h ago
As to general consensus, Hinton gave a recent talk, and he seemed adamant that neural networks (which LLMs are) really are doing reasoning. He gives his reasons for it. Is Hinton considered an outlier or?
refulgentis · 4h ago
Not sure what all this is about, I somewhat regret taking a breaking from coding with LLMs to have it explained to me its all a mirage and a secret and sloppy plan for getting me an automagic egirl or something. ;)
intended · 9m ago
The point being made doesn’t impact people who can find utility from LLM output.

It’s only when you need to apply it to domains outside of code, or a domain where it needs to actually reason, that it becomes an issue.

XenophileJKO · 3h ago
Right? Oh this fairly novel solution the the problem I was having that works and is well tested. Oh throw it away.. sorry the model can't think of stuff..

Back to square one!!

kazinator · 3h ago
Because model size is a trivial parameter, and not a new paradigm.

What you're saying is like, you can't extrapolate that long division works on 100 digit numbers because you only worked through it using 7 digit numbers and a few small polynomials.

barrkel · 1h ago
Alas, not true. It would be easier to predict progress if so.
exe34 · 1h ago
This is 100% how it doesn't work with LLMs.

No comments yet

logicchains · 2h ago
The extrapolation doesn't work if the transformer is too shallow (too few layers) relative to sequence length, because of https://arxiv.org/abs/2503.03961 . A bunch of tasks become unfeasible when the layer count is too low, and 4 layers is way too low. I.e. linearly increasing the number of layers in a model can result in a superlinear increase in performance on tasks like reasoning.
OtherShrezzing · 3h ago
I think it is worth writing about simply because it might get the (cost constrained) researcher’s work in front of someone who has the near-unlimited research budgets at one of the big AI companies.
suddenlybananas · 4h ago
>Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results

You're conflating two very different things. Training on synthetic data one time is very different than cyclically training models on their own data. It has nothing to do with model size.

NitpickLawyer · 4h ago
Perhaps I worded it poorly. My main point was that articles focus on the wrong thing. Most coverage of that paper was "Using LLM generated data leads to CATASTROPHIC collapse". Without reading the fineprint.

> [...] cyclically training models on their own data. It has nothing to do with model size.

Of course it does. GRPO is basically "training models on their own data". You sample, you check for a known truth, you adapt the weights. Repeat. And before GRPO there was RLAIF which showed improving scores at 3 "stages" of generate - select - re-train. With diminishing returns after 3 stages, but no catastrophic collapse.

My main point was about articles and cherrypicking catchy phrases, not criticising research. We need the research. But we also need good articles that aren't written just for the negativity sells titles.

cheeky edit: see this thread [1]. I know slashdot has fallen a lot in the last years, but I skimmed the root comments. Not one addressing the "toy" model problem. Everyone reads the title, and reinforces their own biases. That's the main problem I was trying to address.

1 - https://slashdot.org/story/25/08/11/2253229/llms-simulated-r...

suddenlybananas · 4h ago
If you have a ground truth that you're comparing to, that's not training on your own data.
tankenmate · 4h ago
"Training on synthetic data one time is very different than cyclically training models on their own data.", but every one with even a modicum of understanding of feedback knows that cyclic training on its own output will end in tears; it's bordering on a tautologic inverse.
kazinator · 3h ago
> conclusions gathered from toy models and implying this generalises to production LLMs is useless

You are just trotting out the tired argument that model size magically fixes the issues, rather than just improves the mirage, and so nothing can be known about models with M parameters by studying models with N < M parameters.

Given enough parameters, a miraculous threshold is reached whereby LLMs switch from interpolating to extrapolating.

Sure!

ricardobeat · 3h ago
That’s what has been seen in practice though. SOTA LLMs have been shown again and again to solve problems unseen in their data set; and despite their shortcomings they have become extremely useful for a wide variety of tasks.
loosetypes · 3h ago
Mind linking any examples (or categories) of problems that are definitively not in pre training data but can still be solved by LLMs? Preferably something factual rather than creative, genuinely curious.

Dumb question but anything like this that’s written about on the internet will ultimately end up as training fodder, no?

syllogism · 4h ago
It's interesting that there's still such a market for this sort of take.

> In a recent pre-print paper, researchers from the University of Arizona summarize this existing work as "suggest[ing] that LLMs are not principled reasoners but rather sophisticated simulators of reasoning-like text."

What does this even mean? Let's veto the word "reasoning" here and reflect.

The LLM produces a series of outputs. Each output changes the likelihood of the next output. So it's transitioning in a very large state space.

Assume there exists some states that the activations could be in that would cause the correct output to be generated. Assume also that there is some possible path of text connecting the original input to such a success state.

The reinforcement learning objective reinforces pathways that were successful during training. If there's some intermediate calculation to do or 'inference' that could be drawn, writing out a new text that makes that explicit might be a useful step. The reinforcement learning objective is supposed to encourage the model to learn such patterns.

So what does "sophisticated simulators of reasoning-like text" even mean here? The mechanism that the model uses to transition towards the answer is to generate intermediate text. What's the complaint here?

It makes the same sort of sense to talk about the model "reasoning" as it does to talk about AlphaZero "valuing material" or "fighting for the center". These are shorthands for describing patterns of behaviour, but of course the model doesn't "value" anything in a strictly human way. The chess engine usually doesn't see a full line to victory, but in the games it's played, paths which transition through states with material advantage are often good -- although it depends on other factors.

So of course the chain-of-thought transition process is brittle, and it's brittle in ways that don't match human mistakes. What does it prove that there are counter-examples with irrelevant text interposed that cause the model to produce the wrong output? It shows nothing --- it's a probabilistic process. Of course some different inputs lead to different paths being taken, which may be less successful.

wzdd · 1h ago
> The mechanism that the model uses to transition towards the answer is to generate intermediate text.

Yes, which makes sense, because if there's a landscape of states that the model is traversing, and there are probablistically likely pathways between an initial state and the desired output, but there isn't a direct pathway, then training the the model to generate intermediate text in order to move across that landscape so it can reach the desired output state is a good idea.

Presumably LLM companies are aware that there is (in general) no relationship between the generated intermediate text and the output, and the point of the article is that by calling it a "chain of thought" rather than "essentially-meaningless intermediate text which increases the number of potential states the model can reach" users are misled into thinking that the model is reasoning, and may then make unwarranted assumptions, such as that the model could in general apply the same reasoning to similar problems, which is in general not true.

skywhopper · 46m ago
So, you agree with the point that they’re making and you’re mad about it? It’s important to state that the models aren’t doing real reasoning because they are being marketed and sold as if they are.

As for your question: ‘So what does "sophisticated simulators of reasoning-like text" even mean here?’

It means CoT interstitial “reasoning” steps produce text that looks like reasoning, but is just a rough approximation, given that the reasoning often doesn’t line up with the conclusion, or the priors, or reality.

syllogism · 29m ago
What is "real reasoning"? The mechanism that the models use is well described. They do what they do. What is this article's complaint?
intended · 4m ago
“the mechanism the models us is well described”

Vs

Total AI capex in the past 6 months was greater than US consumer spending

Or

AGI is coming

Or

AI Agents will be able to do most white collar work

——

The paper is addressing parts of the conversation and expectations of AI that are in the HYPE quadrant. There’s money riding on the idea that AI is going to begin to reason reliably. That it will work as a ghost in the machine.

bubblyworld · 3h ago
Not sure why everyone is downvoting you as I think you raise a good point - these anthropomorphic words like "reasoning" are useful as shorthands for describing patterns of behaviour, and are generally not meant to be direct comparisons to human cognition. But it goes both ways. You can still criticise the model on the grounds that what we call "reasoning" in the context of LLMs doesn't match the patterns we associate with human "reasoning" very well (such as ability to generalise to novel situations), which is what I think the authors are doing.
drawfloat · 7m ago
""Sam Altman says the perfect AI is “a very tiny model with superhuman reasoning".""

It is being marketed as directly related to human reasoning.

zerof1l · 1h ago
> ... that these "reasoning" models can often produce incoherent, logically unsound answers when questions include irrelevant clauses or deviate even slightly from common templates found in their training data.

I have encountered this problem numerous times, now. It really makes me believe that the models do not really understand the topic, even the basics but just try to predict the text.

One recent example was me asking the model to fix my docker-compose file. In it, there's the `network: host` for the `build` part. The model kept assuming that the container would be running with the host network and kept asking me to remove it as a way to fix my issue, even though it wouldn't do anything for the container that is running. Because container runs on `custom_net` network only. The model was obsessed with it and kept telling me to remove it until I explicitly told that it is not, and cannot be the issue.

``` services:

  app:

    build:

      network: host

    networks:

      custom_net:

    ...
```
afro88 · 1h ago
I have a real world problem I gave o1 when it came out and it got it quite wrong. It's a scheduling problem with 4 different constraints that vary each day, and success criteria that need to be fulfilled over the whole week.

GPT-5 Thinking (Think Longer) and Opus 4.1 Extended Thinking both get it right.

Maybe this unique problem is somehow a part of synthetic training data? Or maybe it's not and the paper is wrong? Either way, we have models that are much more capable at solving unique problems today.

mirekrusin · 5h ago
Hold on their evaluation tasks are based on rotating letters in text? Isn't this known weak area for token based models?
Terr_ · 4h ago
I think that's the point, really: It's a reliable and reproducible weakness, but also one where the model can be trained to elicit impressive-looking "reasoning" about what the problem is and how it "plans" to overcome it.

Then when it fails to apply the "reasoning", that's evidence the artificial expertise we humans perceived or inferred is actually some kind of illusion.

Kind of like a a Chinese Room scenario: If the other end appears to talk about algebra perfectly well, but just can't do it, that's evidence you might be talking to a language-lookup machine instead of one that can reason.

hooskerdu · 4h ago
Reminds me of a number of grad students I knew who could “talk circles” around all sorts of subjects but failed to ever be able to apply anything.
Terr_ · 4h ago
Heh, but just because a human can fail at something doesn't mean everything that fails at it is human. :p
boredhedgehog · 3h ago
> Then when it fails to apply the "reasoning", that's evidence the artificial expertise we humans perceived or inferred is actually some kind of illusion.

That doesn't follow, if the weakness of the model manifests on a different level we wouldn't call rational in a human.

For example, a human might have dyslexia, a disorder on the perceptive level. A dyslexic can understand and explain his own limitation, but that doesn't help him overcome it.

moi2388 · 2h ago
“ the researchers created a carefully controlled LLM environment in an attempt to measure just how well chain-of-thought reasoning works when presented with "out of domain" logical problems that don't match the specific logical patterns found in their training data.”

Why? If it’s out of domain we know it’ll fail.

podgorniy · 35m ago
> Why? If it’s out of domain we know it’ll fail.

To see if LLMs adhere to logic or observed "logical" responses are rather reproduction of patterns.

I personally enjoy this idea of isolation "logic" from "pattern" and seeing if "logic" will manifest in LLM "thinking" about in "non-patternized" domain.

--

Also it's never bad give proves to public that "thinking" (like "intelligence") in AI context isn't the same thing we think about intuitively.

--

> If it’s out of domain we know it’ll fail.

Below goes question which is out of domain. Yet LLMs handle the replies in what appearing as logical way.

``` Kookers are blight. And shmakers are sin. If peker is blight and sin who is he? ```

It is out of domain and it does not fail (I've put it through thinking gemini 2.5). Now back to article. Is observed logic intristic to LLMs or it's an elaborate form of a pattern? Acoording to article it's a pattern.

Octoth0rpe · 1h ago
I don't think we know that it'll fail, or at least that is not universally accepted as true. Rather, there are claims that given a large enough model / context window, such capabilities emerge. I think skepticism of that claim is warranted. This research validates that skepticism, at least for a certain parameters (model family/size, context size, etc).
Gusarich · 5h ago
The article already seems outdated on the first day. The key points about SFT are irrelevant in the era of RL.
acosmism · 5h ago
remind me in 2 days
Frieren · 5h ago
This assessment fits with my anecdotal evidence. LLMs just cannot reason in any basic way.

LLMs have a large knowledge base that can be spit out at a moment notice. But they have zero insight on its contents, even when the information has just been asked a few lines before.

Most of the "intelligence" that LLMs show is just the ability to ask in the correct way the correct questions mirrored back to the user. That is why there is so many advice on how to do "proper prompting".

That and the fact that most questions have already been asked before as anyone that spend some time in StackOverflow back in the day realized. And memory and not reasoning is what is needed to answer them.

PeterStuer · 4h ago
Please don't tell me you were one of those marking every SO question as duplicate, more often than not missing the entire nuance in the question that made it not a duplicate at all, and the answers to the so called previously asked question utterly unusable?

This was one of those infuriating things that drove so many away from SO and jump ship the second there was an alternative.

antihipocrat · 4h ago
I'm not sure why duplicates were ever considered an issue. For certain subjects (like JS) things evolved so quickly during the height of SO that even a year old answer was outdated.

That and search engines seemed to promote more recent content.. so an old answer sank under the ocean of blog spam

ceejayoz · 3h ago
SO wanted to avoid being a raw Q&A site in favor of something more like a wiki.

If a year-old answer on a canonical question is now incorrect, you edit it.

MoreQARespect · 2h ago
That's a valid goal, but they should have adapted the software to the community instead of trying to adapt the community to the software.

SO's biggest asset was its community and while they treated it with some respect in the beginning they took it for granted and trashed it later.

ceejayoz · 2h ago
I think this policy was, in large part, intended to respect the user base, who get exhausted answering the same question over and over.

I do agree they later trashed that relationship with the Monica incident and AI policies.

PeterStuer · 2h ago
But the answer has not become incorrect. It is still correct for that question in that specific context. More likely, the 'canonicalization process' was overly coarse (for SEO?), inconsistent and confused.
moi2388 · 2h ago
Then they should have made a wiki instead of a Q&A site
ceejayoz · 1h ago
They did, really. That's why I can edit anyone else's questions and answers.
Frieren · 3h ago
I was "playing" the gamification part of StackOverflow. I wanted to ask a good question for points. But it was very difficult because any meaningful question had already been asked. It was way easier to find questions to answer.
ceejayoz · 4h ago
Every time I ask people for an example of this, and get one, I agree with the duplicate determination. Sometimes it requires a little skimming of the canonical answers past just the #1 accepted one; sometimes there's a heavily upvoted clarification in a top comment, but it's usually pretty reasonable.
jongjong · 2h ago
I've used LLMs to generate code for a custom serverless framework which I wrote from scratch that it had never seen before. The framework follows some industry conventions but applied in a distinct way with some distinct features which I have not yet encountered in any other framework...

I'm willing to accept that maybe LLMs cannot invent entirely new concepts but I know for a fact that they can synthesize and merge different unfamiliar concepts in complex logical ways to deliver new capabilities. This is valuable on its own.

thisisauserid · 2h ago
(in mice)
Martin_Silenus · 4h ago
If only we could train people like that to see their reasoning output...
floppiplopp · 5h ago
'Chain-of-thought AI "degrades significantly" when asked to generalize beyond training.' - yeah thanks Captain Obvious.