The Illusion of the Illusion of Thinking – A Comment on Shojaee et al. (2025)

15 gfortaine 11 6/16/2025, 6:46:47 AM arxiv.org ↗

Comments (11)

dr_dshiv · 4h ago
Pretty serious flaws in the original paper.

1. Scoring unsolvable challenges as incorrect

2. Not accounting for token span

3. Not allowing LLMs to code as part of solution.

I tend to see Apple’s paper as an excuse for not having competitive products.

throwfaraway4 · 3h ago
Sounds like confirmation bias in action
TIcomPOCL · 1h ago
- Token claim: The limit was 64k, and you can see in Apple’s paper that they at most hit 20k before decline (figure 6)

- Impossible river claim: Again in figure 6, you can see that the performance declines before we reach 5 actors. So while it wasn’t necessary to test until 20, the results still indicate, impossibility doesn't explain the results.

MarkusQ · 2h ago
The people trying to show that LLMs don't think are working too hard. It's trivially easy, imho:

https://chatgpt.com/share/68504396-e300-800c-a7ff-dde5fe1572...

ForHackernews · 5h ago
Wait is C. Opus just the anthropic bot? Did I waste my time reading AI nonsense?
credit_guy · 4h ago
MarkusQ · 3h ago
Could be. Someone hallucinated the arXive reference for the Apple paper.
mfro · 4h ago
> These findings highlight the importance of careful experimental design when evaluating AI reasoning capabilities.

I would like to carefully design my response to this article with a downvote

ForHackernews · 5h ago
"5 Alternative Representations Restore Performance To test whether the failures reflect reasoning limitations or format constraints, we conducted preliminary testing of the same models on Tower of Hanoi N = 15 using a different representation: Prompt: "Solve Tower of Hanoi with 15 disks. Output a Lua function that prints the solution when called."

Results: Very high accuracy across tested models (Claude-3.7-Sonnet, Claude Opus 4, OpenAI o3, Google Gemini 2.5), completing in under 5,000 tokens.

The generated solutions correctly implement the recursive algorithm, demonstrating intact reasoning capabilities when freed from exhaustive enumeration requirement""

Is there's something I'm missing here?

This seems like it demonstrates the exact opposite of what the authors are claiming: Yes, your bot is an effective parrot that can output a correct Lua program that exists somewhere in the training data. No, your bot is not "thinking" and cannot effectively reason through the algorithm itself.

TIcomPOCL · 4h ago
It seems to just reillustrate the point that the model cannot follow algorithmic steps once it is out of distribution.
ForHackernews · 5h ago
> Recent reports have claimed that most 7th graders are unable to independently derive the Pythagorean Theorem, however our analysis reveals that these apparent failures stem from experimental design choices rather than inherent student limitations.

When given access to Google and prompted to "tell me how to find the length of hypotenuse of a right triangle", a majority of middle-schoolers produced the correct Pythagorean Theorem, demonstrating intact reasoning capabilities when freed from the exhaustive comprehension requirement.