Seven replies to the viral Apple reasoning paper and why they fall short

300 spwestwood 227 6/14/2025, 7:52:38 PM garymarcus.substack.com ↗

Comments (227)

thomasahle · 14h ago

> 1. Humans have trouble with complex problems and memory demands. True! But incomplete. We have every right to expect machines to do things we can’t. [...] If we want to get to AGI, we will have to better.

I don't get this argument. The paper is about "whether RLLMs can think". If we grant "humans make these mistakes too", but also "we still require this ability in our definition of thinking", aren't we saying "thinking in humans is a illusion" too?

briandw · 4h ago

Humans use tools to extend their abilities. LLM can do the same. In this paper they didn’t allow tool use. When others gave the tower of hanoi task to llms with tool use, like a python env, they were able to complete the task.

xienze · 2h ago

But the Tower of Hanoi can be solved without "tools" by humans, simply by understanding the problem, thinking about the solution, and writing it out. Having the LLM shell out to a Python example that it "wrote" (or rather, "pasted" since surely a Python solution to the Tower of Hanoi was part of its training set) is akin to a human Googling "program to solve Tower of Hanoi", copy-pasting and running the solution. Yes the LLM has "reasoned" that the solution to the problem is call out to a solution that it "knows" is out there, but that's not really "thinking" about how to solve a problem in the human sense.

What happens when some novel Tower of Hanoi-esque puzzle is presented and there's nothing available in its training set to reference as an executable solution? A human can reason about and present a solution, but an LLM? Ehh...

DiogenesKynikos · 2h ago

LLMs are perfectly capable of writing code to solve problems that are not in their training set. I ask LLMs to write code for niche problems that you won't find answers to just by Googling all the time. The LLMs usually get it right.

xienze · 2h ago

> LLMs are perfectly capable of writing code to solve problems that are not in their training set.

Examples of these problems? You'll probably find that they're simply compositions of things already in the training set. For example, you might think that "here's a class containing an ID field and foobar field. Make a linked list class that stores inserted items in reverse foobar order with the ID field breaking ties" is something "not in" the training set, but it's really just a composition of the "make a linked list class" and "sort these things based on a field" problems.

FINDarkside · 13h ago

Agreed. But also his point about AGI is incorrect. AI that will perform on the level of average human in every task is AGI by definition.

dvfjsdhgfv · 1h ago

The Hanoi Towers example demonstrates that SOTA RLMs struggle with tasks a pre-schooler solves.

The implication here is that they excel at things that occur very often and are bad at novelty. This is good for individuals (by using RLMs I can quickly learn about many other aspects of human body of knowledge in a way impossible/inefficient with traditional methods) but they are bad at innovation. Which, honestly, is not necessarily bad: we can offload lower-level tasks[0] to RLMs and pursue innovation as humans.

[0] Usual caveats apply: with time, the population of people actually good at these low-level tasks will diminish, just as we have very few Assembler programmers for Intel/AMD processors.

pzo · 6h ago

Why AGI need to be even as good as average human. If you get someone with 80 IQ is still smart enough to reason and do plenty of menial tasks. Also not sure why AGI need to be as good in every task? Average human will excel others at few tasks and sux terribly in many others.

Someone · 6h ago

Because that’s how AGI is defined. https://en.wikipedia.org/wiki/Artificial_general_intelligenc...: “Artificial general intelligence (AGI)—sometimes called human‑level intelligence AI—is a type of artificial intelligence that would match or surpass human capabilities across virtually all cognitive tasks”

But yes, you’re right that software needs not be AGI to be useful. Artificial narrow intelligence or weak AI (https://en.wikipedia.org/wiki/Weak_artificial_intelligence) can be extremely useful, even something as narrow as a services that transcribes speech and can’t do anything else.

simonw · 13h ago

That very much depends on which AGI definition you are using. I imagine there are a dozen or so variants out there. See also "AI" and "agents" and (apparently) "vibe coding" and pretty much every other piece of jargon in this field.

FINDarkside · 13h ago

I think it's very widely accepted definition and there's really no competing definitions either as far as I know. While some people might think AGI means superintelligence, it's only because they've heard the term but never bothered to look up what it means.

simonw · 11h ago

OpenAI: https://openai.com/index/how-should-ai-systems-behave/#citat...

"By AGI, we mean highly autonomous systems that outperform humans at most economically valuable work."

AWS: https://aws.amazon.com/what-is/artificial-general-intelligen...

"Artificial general intelligence (AGI) is a field of theoretical AI research that attempts to create software with human-like intelligence and the ability to self-teach. The aim is for the software to be able to perform tasks that it is not necessarily trained or developed for."

DeepMind: https://arxiv.org/abs/2311.02462

"Artificial General Intelligence (AGI) is an important and sometimes controversial concept in computing research, used to describe an AI system that is at least as capable as a human at most tasks. [...] We argue that any definition of AGI should meet the following six criteria: We emphasize the importance of metacognition, and suggest that an AGI benchmark should include metacognitive tasks such as (1) the ability to learn new skills, (2) the ability to know when to ask for help, and (3) social metacognitive abilities such as those relating to theory of mind. The ability to learn new skills (Chollet, 2019) is essential to generality, since it is infeasible for a system to be optimized for all possible use cases a priori [...]"

The key difference appears to be around self-teaching and meta-cognition. The OpenAI one shortcuts that by focusing on "outperform humans at most economically valuable work", but others make that ability to self-improve key to their definitions.

Note that you said "AI that will perform on the level of average human in every task" - which disagrees very slightly with the OpenAI one (they went with "outperform humans at most economically valuable work"). If you read more of the DeepMind paper it mentions "this definition notably focuses on non-physical tasks", so their version of AGI does not incorporate full robotics.

bluefirebrand · 10h ago

Doesn't the "G" in AGI stand for "General" as in "Generally Good at everything"?

neom · 8h ago

I think the G is what really screws things up. I thought it was, as good as the general human, but upon googling it has a defined meaning among researchers. There appears to be confusion all over the place tho.

General-Purpose (Wide Scope): It can do many types of things.

Generally as Capable as a Human (Performance Level): It can do what we do.

Possessing General Intelligence (Cognitive Mechanism): It thinks and learns the way a general intelligence does.

So, for researchers, general intelligence is characterized by: applying knowledge from one domain to solve problems in another, adapting to novel situations without being explicitly programmed for them, and: having a broad base of understanding that can be applied across many different areas.

adastra22 · 6h ago

Yes, but “good at” here has a very limited, technical meaning, which can be oversimplified as “better than random chance.”

If something can be better than random chance in any arbitrary problem domain it was not trained on, that is AGI.

math_dandy · 10h ago

I was hoping the accepted definition would not use humans as a baseline, rather that humans would be an (the) example of AGI.

thomasahle · 5h ago

The argument of (1) doesn't really have anything to do with humans or antromorphising. We're not even discussing AGI, we're just talking about the property of "thinking".

If somebody claims "computers can't do X, hence they can't think". A valid counter argument is "humans can't do X either, but they can think."

It's not important for the rebuttal that we used humans. Just that there exists entities that don't have property X, but are able to think. This shows X is not required for our definition of "thinking".

bastawhiz · 10h ago

The A in AGI is "artificial" which sort of precludes humans from being AGI (unless you have a very unconventional belief about the origin of humans).

Since there's not really a whole lot of unique examples of general intelligence out there, humans become a pretty straightforward way to compare.

xeonmc · 8h ago

> unless you have a very unconventional belief about the origin of humans

No so unconventional in many cultures.

bastawhiz · 8h ago

Certainly many cultures and religions believe in some flavor of intelligent design, but you could argue that if the natural world (for what we generally regard as "the natural world") is created by the same entity or entities that created humans, that doesn't make humans artificial. Ignoring the metaphysical (souls and such) I'm struggling to think of a culture that believes the origin of humans isn't shared by the world.

In this case, I was thinking of unusual beliefs like aliens creating humans or humans appearing abruptly from an external source such as through panspermia.

usef- · 6h ago

Yes. I wonder if he was thinking of ASI, not AGI

gylterud · 4h ago

ASI meaning Artificial Super Intelligence, I guess.

adastra22 · 6h ago

Most people are. One of my pet peeves is that people falsely equate AGI with ASI, constantly. We have had full AGI for years now. It is a powerful tool, but not what people tend to think of as god-like “AGI.”

mathgradthrow · 11h ago

the average human is good at something, and sucks at almost everything. Human performance at chess and average performance at chess differ by 7 orders of magnitude.

datadrivenangel · 9h ago

Your standard model of human needs a little bit of fine tuning for most games.

jltsiren · 9h ago

AGI should perform on the level of an experienced professional in every task. The average human is useless for pretty much everything but capable of learning to perform almost any task, given enough motivation and effort.

Or perhaps AGI should be able to reach the level of an experienced professional in any task. Maybe a single system can't be good at everything, if there are inherent trade-offs in learning to perform different tasks well.

godelski · 9h ago

For comparison, the average person can't print Hello World in python. Your average programmer (probably) can.

It's surprisingly simple to be above average in most tasks. Which people often confuse with having expertise. It's probably pretty easy to get into the 80th percentile of most subjects. That won't make you the 80th percentile of people that do the thing, but most people don't. I'd wager 80th percentile is still amateur.

MoonGhost · 6h ago

> The average human is useless for pretty much everything but capable of learning to perform almost any task

But only the limited number of tasks per human.

> Or perhaps AGI should be able to reach the level of an experienced professional in any task.

Even if it performs just better than untrained human but on any task this will be superhuman level. As no human can do it.

jltsiren · 6h ago

The G in AGI stands for "general", not for "superhuman". An intelligence that can't learn to perform information processing and decision-making tasks people routinely do does not seem very general to me.

whatagreatboy · 6h ago

the real ability of intelligence is to correct mistakes in a gradual and consistent way.

autobodie · 14h ago

Agree. Both sides of the argument are unsatisfying. They seem like quantitative answers to a qualitative question.

serbuvlad · 13h ago

"Have we created machines that can do something qualitatevely similar to that part of us that can correlate known information and pattern recognition to produce new ideas and solutions to problems -- that part we call thinking?"

I think the answer to this question is certainly "Yes". I think the reason people deny this is because it was just laughably easy in retrospect.

In mid-2022 people were like. "Wow this GPT3 thing generates kind of coherent greentexts"

Since then really only we got: larger models, larger models, search, agents, larger models, chain-of-thought and larger models.

And from a novelty toy we got a set of tools that at the very least massively increase human productivity in a wide range of tasks and certainly pass any Turing test.

Attention really was all you needed.

But of course, if you ask a buddhist monk, he'll tell you we are attention machines, not computation machines.

He'll also tell you, should you listen, that we have a monkey in our mind that is constantly producing new thoughts. This monkey is not who we are, it's an organ. It's thoughts are not our thoughts. It's something we perceive. And that we shouldn't identify with.

Now we have thought-genrating-monkeys with jet engines and adrenaline shots.

This can be good. Thought-genrating-monkeys put us on the moon and wrote Hamlet and the Oddesy.

The key is to not become a slave to them. To realize that our worth consists not in our ability to think. And that we are more than that.

viccis · 13h ago

>I think the answer to this question is certainly "Yes".

It is unequivocally "No". A good joint distribution estimator is always by definition a posteriori and completely incapable of synthetic a priori thought.

serbuvlad · 9h ago

The human mind is an estimator too.

The fact that the human mind can think in concepts, images AND words, and then compresses that into words for transmission, wheras LLMs think directly in words, is no object.

If you watch someone reach a ledge, your mind will generate, based on past experience, a probabilistic image of that person falling. Then it will tie that to the concept of problem (self-attention) and start generating solutions, such as warning them or pulling them back etc.

LLMs can do all this too, but only in words.

corimaith · 7m ago

Do you think language is sufficient to model reality (not just physical, but abstract) here?

I think not, we can get close, but there exists problems and situations beyond that, especially in mathematics and philosophy. And I don't a visual medium or combination of is sufficient either, there's a more fundamental, underlying abstract structure that we use to model reality.

viccis · 6h ago

>LLMs think

Quick aside here: They do not think. They estimate generative probability distributions over the token space. If there's one thing I do agree with Dijkstra on, it's that it's important not to anthropomorphize mathematical or computing concepts.

As far as the rest of your comment, I generally agree. It sort of fits a Kantian view of epistemology, in which we have sensibility giving way to semiotics (we'll say words and images for simplicity) and we have concepts that we understand by a process of reasoning about a manifold of things we have sensed.

That's not probabilistic though. If we see someone reach a ledge and take a step over it, then we are making a synthetic a priori assumption that they will fall. It's synthetic because there's nothing about a ledge that means the person must fall. It's possible that there's another ledge right under we can't see. Or that they're in zero gravity (in a scifi movie maybe). Etc. It's a priori because we're making this statement not based on what already happened but rather what we know will happen.

We accomplish this by forming concepts such as "ledge", "step", "person", "gravity", etc., as we experience them until they exist in our mind as purely rational concepts we can use to reason about new experiences. We might end up being wrong, we might be right, we might be right despite having made the wrong claims (maybe we knew he'd fall because of gravity, however there was no gravity but he ended up being pushed by someone and "falling" because of it, this is called a "Gettier problem"). But our correctness is not a matter of probability but rather one of how much of the situation we understand and how well we reason about it.

Either way, there is nothing to suggest that we are working from a probability model. If that were the case, you wind up in what's called philosophical skepticism [1], in which, if all we are are estimation machines based on our observances, how can we justify any statement? If every statement must have been trained by a corresponding observation, then how do we probabilistically model things like causality that we would turn to to justify claims?

Kant's not the only person to address this skepticism, but he's probably the most notable to do so, and so I would challenge you to justify whether the "thinking" done by LLMs has any analogue to the "thinking" done using the process described in my second paragraph.

[1] https://en.wikipedia.org/wiki/Philosophical_skepticism#David...

mofeien · 4h ago

> We accomplish this by forming concepts such as "ledge", "step", "person", "gravity", etc., as we experience them until they exist in our mind as purely rational concepts we can use to reason about new experiences.

So we receive inputs from the environment and cluster them into observations about concepts, and form a collection of truth statements about them. Some of them may be wrong, or apply conditionally. These are probabilistic beliefs learned a posteriori from our experiences. Then we can do some a priori thinking about them with our eyes and ears closed with minimal further input from the environment. We may generate some new truth statements that we have not thought about before (e. g. "stepping over the ledge might not cause us to fall because gravity might stop at the ledge") and assign subjective probabilities to them.

This makes the a priori seem to always depend on previous a posterioris, and simply mark the cutoff from when you stop taking environmental input into account for your reasoning within a "thinking session". Actually, you might even change your mind mid-reasoning process based on the outcome of a thought experiment you perform which you use to update your internal facts collection. This would give the a priori reasing you're currently doing an even stronger a posteriori character. To me, these observations above basically dissolve the concept of a priori thinking.

And this makes it seem like we are very much working from probabilistic models, all the time. To answer how we can know anything: If a statement's subjective probability becomes high enough, we qualify it as a fact (and may be wrong about it sometimes). But this allows us to justify other statements (validly, in ~ 1-sometimes of cases). Hopefully our world model map converges towards a useful part of the territory!

serbuvlad · 5h ago

But I do not think humans think like that by default.

When I spill a drink, I don't think "gravity". That's too slow.

And I don't think humans are particularly good at that kind of rational thinking.

viccis · 5h ago

>When I spill a drink, I don't think "gravity". That's too slow.

I think you do, you just don't need to notice it. If you spilled it in the International Space Station, you'd probably respond differently even if you didn't have to stop and contemplate the physics of the situation.

nerdponx · 12h ago

That doesn't seem true to me at all. Let's say you fit y=c+bx+ax^2 on the domain -10,10 with 1000 data points uniformly distributed along x and with no more than 1% noise in observed y. Your model will be pretty damn good and absolutely will be able to generate "synthetic a priori" y outputs for any given x within the domain.

Now let's say you didn't know the true function and had to use a neural network instead. You would probably still get a great result in the sense of generating "new" outputs that are not observed in the training data, as long as they are within or reasonably close to the original domain.

LLMs are that. With enough data and enough parameters and the right inductive bias and the right RLHF procedure etc, they are getting increasingly good at estimating a conditional next token distribution given the context. If by "synthetic" you mean that an LLM can never generate a truly new idea that was not in it's training data, then that becomes the question of what the "domain" of the data really is.

I'm not convinced that LLMs are strictly limited to ideas that they have "learned" in their data. Before LLMs, I don't think people realized just how much pattern and structure there was in human thought, and how exposed it was through text. Given the advances of the last couple of years, I'm starting to come around to the idea that text contains enough instances of reasoning and thinking that these models might develop some kind of ability to do something like reasoning and thinking simply because they would have to in order to continue decreasing validation loss.

I want to be clear that I am not at all an AI maximalist, and the fact that these things are built largely on copyright infringement continues to disgust me, as do the growing economic and environmental externalities and other problems surrounding their use and abuse. But I don't think it does any good to pretend these things are dumber than they are, or to assume that the next AI winter is right around the corner.

viccis · 7h ago

>Your model will be pretty damn good and absolutely will be able to generate "synthetic a priori" y outputs for any given x within the domain.

You don't seem to understand what synthetic a priori means. The fact that you're asking a model to generate outputs based on inputs means it's by definition a posteriori.

>You would probably still get a great result in the sense of generating "new" outputs that are not observed in the training data, as long as they are within or reasonably close to the original domain.

That's not cognition and has no epistemological grounds. You're making the assumption that better prediction of semiotic structure (of language, images, etc.) results in better ability to produce knowledge. You can't model knowledge with language alone, the logical positivists found that out to their disappointment a century or so ago.

For example, I don't think you adequately proved this statement to be true:

>they would have to in order to continue decreasing validation loss

This works if and only if the structure of knowledge lies latently beneath the structure of semiotics. In other words, if you can start identifying the "shape" of the distribution of language, you can perturb it slightly to get a new question and expect to get a new correct answer.

autobodie · 13h ago

> The key is to not become a slave to them. To realize that our worth consists not in our ability to think. And that we are more than that.

I cannot afford to consider whether you are right because I am a slave to capital, and therefore may as well be a slave to capital's LLMs. The same goes for you.

serbuvlad · 9h ago

I am not a slave to capital. I am a slave to the harsh nature of the world.

I get too hot in summer and too cold in winter. I die of hunger. I am harassed by critters of all sorts.

And when my bed breaks, to keep my fragile spine from straining at night, I _want_ some trees to be cut, some mattresses to be provisioned, some designers to be provisioned etc. And capital is what gets me that, from people I will never meet, who wouldn't blink once if I died tomorrow.

LinXitoW · 4h ago

Considering capitalism is a very new phenomenon in human history, how do you think people survived and thrived for the other 248000 years? It's as ludicrous to believe that capitalism is some kind of force of nature as it is to believe kings were chosen by god.

serbuvlad · 3h ago

That depends on how you define your terms. A pro-capital laissez-faire policy is new, sure.

But the first civilizations in the world around 3000BC had trade, money, banking, capital accumulation, divison of labour etc.

jes5199 · 10h ago

I think the Apple paper is practically a hack job - the problem was set up in such a way that the reasoning models must do all of their reasoning before outputting any of their results. Imagine a human trying to solve something this way: you’d have to either memorize the entire answer before speaking or come up with a simple pattern you could do while reciting that takes significantly less brainpower - and past a certain size/complexity, it would be impossible.

And this isn’t how LLMs are used in practice! Actual agents do a thinking/reasoning cycle after each tool-use call. And I guarantee even these 6-month-old models could do significantly better if a researcher followed best practices.

Brystephor · 10h ago

Forcing reasoning is analogous to requiring a student to show their work when solving a problem if im understanding the paper correctly.

> you’d have to either memorize the entire answer before speaking or come up with a simple pattern you could do while reciting that takes significantly less brainpower

This part i dont understand. Why would coming up with an algorithm (e.g. a simple pattern) and reciting it be impossible? The paper doesnt mention the models coming up with the algorithm at all AFAIK. If the model was able to come up with the pattern required to solve the puzzles and then also execute (e.g. recite) the pattern, then that'd show understanding. However the models didn't. So if the model can answer the same question for small inputs, but not for big inputs, then doesnt that imply the model is not finding a pattern for solving the answer but is more likely pulling from memory? Like, if the model could tell you fibbonaci numbers when n=5 but not when n=10, that'd imply the numbers are memorized and the pattern for generation of numbers is not understood.

qarl · 9h ago

> The paper doesnt mention the models coming up with the algorithm at all AFAIK.

And that's because they specifically hamstrung their tests so that the LLMs were not "allowed" to generate algorithms.

If you simply type "Give me the solution for Towers of Hanoi for 12 disks" into chatGPT it will happily give you the answer. It will write program to solve it, and then run that program to produce the answer.

But according to the skeptical community - that is "cheating" because it's using tools. Nevermind that it is the most effective way to solve the problem.

https://chatgpt.com/share/6845f0f2-ea14-800d-9f30-115a3b644e...

zoul · 6h ago

This is not about finding the most effective solution, it’s about showing that they “understand” the problem. Could they write the algorithm if it were not in their training set?

boredhedgehog · 6h ago

If that's the point, shouldn't they ask the model to explain the principle for any number of discs? What's the benefit of a concrete application?

johnecheck · 4h ago

Because that would prove absolutely nothing. There are numerous examples of tower of Hanoi explanations in the training set.

elbear · 1h ago

How do you check that a human understood it and not simply memorised different approaches?

Too · 6h ago

How can one know that's not coming from the pre-trained data. The paper is trying to evaluate whether the LLM has general problem solving ability.

jsnell · 6h ago

The paper doesn't mention it because either the researchers did not care to check the outputs manually, or reporting what was in the outputs would have made it obvious what their motives were.

When this research has been reproduced, the "failures" on the Tower of Hanoi are the model printing out a bunch of steps, saying there is no point in doing it thousands of times more. And they they'd either output an the algorithm for printing the rest in words or code

xtracto · 6h ago

I think the paper got unwanted attention... for a scientific paper. It's like that old paper about a "gravity shielding" podkelnov rings experiment that got publicized by some UK news paper as "scientists find antigravity" and ended up destroying the Russian author's career.

By the way, it seems Appke researchers got inspired by this [1] older chinese paper to get their title. The Chinese author's made a very similar argument, without the experiments. I myself believe Apple experiments are just good curiosities, but don't drive as much of a point as they believe.

[1] https://arxiv.org/abs/2506.02878

wohoef · 15h ago

Good article giving some critique to Apple's paper and Gary Marcus specifically.

https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-gen...

godelski · 9h ago

  > this is a preprint that has not been peer reviewed.

This conversation is peer review...

You don't need a conference for something to be peer reviewed, you only need... peers...

In fact, this paper is getting more peer review than most works. Conferences are notoriously noisy as reviewers often don't care and are happy to point out criticisms. All works have valid criticisms... Finding criticisms is the easy part. The hard part is figuring out if these invalidate the claims or not.

hintymad · 14h ago

Honest question: does the opinion of Gary Marcus still count? His criticism seems more philosophical than scientific. It's hard for me see what he builds or reasons to get to his conclusions.

zer00eyz · 13h ago

> seems more philosophical than scientific

I think this is a fair assessment but reason, and intelligence dont really have an established control or control group. If you build a test and say "Its not intelligent because it can't..." and someone goes out and add's that feature in is it suddenly now intelligent?

If we make a physics break through tomorrow is there any LLM that is going to retain that knowledge permanently as part of its core or will they all need to be re-trained? Can we make a model that is as smart as a 5th grader without shoving the whole corpus of human knowledge into it, folding it over twice and then training it back out?

The current crop of tech doesn't get us to AGI. And the focus to make it "better" is for the most part a fools errand. The real winners in this race are going to be those who hold the keys to optimization: short retraining times, smaller models (with less upfront data), optimized for lower performance systems.

hintymad · 10h ago

> The current crop of tech doesn't get us to AGI

I actually agree with this. Time and again, I can see that LLMs do not really understand my questions, let alone being able to perform logical deductions beyond in-distribution answers. What I’m really wondering is whether Marcus’s way of criticizing LLMs is valid.

Workaccount2 · 11h ago

What gets me, and the author talks about it in the post, is that people will readily attribute correct answers to "its in the training set" but nobody says anything about incorrect answers that are in the training set. LLMs get stuff in the training set wrong all the time, but nobody uses it as evidence that it probably can't lean too hard on it's memorization for complex questions it does get right.

It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.

thrwaway55 · 8h ago

Do you hypothese that they see more wrong examples then right? Why is there concern about model collapse if they are reasoning and can sort it out, why does the data even need to be scrubbed before training?

How many r's really are in Strawberry?

Jensson · 10h ago

> It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.

Both of those can be true at the same time though. They memorize a lot of things, but its fuzzy and when they remember wrong they cannot fix it via reasoning.

Workaccount2 · 10h ago

It's more than fuzzy, they are packing exabytes, perhaps zetabytes of training data into a few terabytes. Without any reasoning ability it must be divine intervention that they ever get anything right...

chongli · 1h ago

It is divine intervention if you believe human minds are the product of a divine creator. Most of the attribution of miraculous reasoning ability on the part of LLMs I would attribute to pareidolia on the part of their human evaluators. I don’t think we’re much closer at all to having an AI which can replace an average minimum wage full-time worker, who will work largely unsupervised but ask their manager for help when needed, without screwing anything up.

We have LLMs that can produce copious text but cannot stop themselves from attempting to solve a problem they have no idea how to solve and making a mess of things as a result. This puts an LLM on the level of an overly enthusiastic toddler at best.

labrador · 15h ago

The key insight is that LLMs can 'reason' when they've seen similar solutions in training data, but this breaks down on truly novel problems. This isn't reasoning exactly, but close enough to be useful in many circumstances. Repeating solutions on demand can be handy, just like repeating facts on demand is handy. Marcus gets this right technically but focuses too much on emotional arguments rather than clear explanation.

swat535 · 15h ago

If that was the case, it would have been great already but these tools can’t even do that. They frequently make mistake repeating the same solutions available everywhere during their “reasoning” process and fabricates plausible hallucinations which you then have to inspect carefully to catch.

woopsn · 13h ago

That alone would be revolutionary - but still aspirational for now. The other day Gemini mixed up left and right on me in response to basic textbook problem.

Jabrov · 15h ago

I’m so tired of hearing this be repeated, like the whole “LLMs are _just_ parrots” thing.

It’s patently obvious to me that LLMs can reason and solve novel problems not in their training data. You can test this out in so many ways, and there’s so many examples out there.

______________

Edit for responders, instead of replying to each:

We obviously have to define what we mean by "reasoning" and "solving novel problems". From my point of view, reasoning != general intelligence. I also consider reasoning to be a spectrum. Just because it cannot solve the hardest problem you can think of does not mean it cannot reason at all. Do note, I think LLMs are generally pretty bad at reasoning. But I disagree with the point that LLMs cannot reason at all or never solve any novel problems.

In terms of some backing points/examples:

1) Next token prediction can itself be argued to be a task that requires reasoning

2) You can construct a variety of language translation tasks, with completely made up languages, that LLMs can complete successfully. There's tons of research about in-context learning and zero-shot performance.

3) Tons of people have created all kinds of challenges/games/puzzles to prove that LLMs can't reason. One by one, they invariably get solved (eg. https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224..., https://ahmorse.medium.com/llms-and-reasoning-part-i-the-mon...) -- sometimes even when the cutoff date for the LLM is before the puzzle was published.

4) Lots of examples of research about out-of-context reasoning (eg. https://arxiv.org/abs/2406.14546)

In terms of specific rebuttals to the post:

1) Even though they start to fail at some complexity threshold, it's incredibly impressive that LLMs can solve any of these difficult puzzles at all! GPT3.5 couldn't do that. We're making incremental progress in terms of reasoning. Bigger, smarter models get better at zero-shot tasks, and I think that correlates with reasoning.

2) Regarding point 4 ("Bigger models might to do better"): I think this is very dismissive. The paper itself shows a huge variance in the performance of different models. For example, in figure 8, we see Claude 3.7 significantly outperforming DeepSeek and maintaining stable solutions for a much longer sequence length. Figure 5 also shows that better models and more tokens improve performance at "medium" difficulty problems. Just because it cannot solve the "hard" problems does not mean it cannot reason at all, nor does it necessarily mean it will never get there. Many people were saying we'd never be able to solve problems like the medium ones a few years ago, but now the goal posts have just shifted.

aucisson_masque · 14h ago

> It’s patently obvious that LLMs can reason and solve novel problems not in their training data.

Would you care to tell us more ?

« It’s patently obvious » is not really an argument, I could say just as well that everyone know LLM can’t resonate or think (in the way we living beings do).

socalgal2 · 6h ago

I'm working on new API. I asked the LLM to read the spec and write tests for it. It does. I don't know if that's "reasoning". I know that no tests exist for this API. I know that the internet is not full of training data for this API because it's a new API. It's also not a CRUD API or some other API that's got a common pattern. And yet, with a very short prompt, Gemini Code Assist wrote valid tests for a new feature.

It certainly feels like more than fancy auto-complete. That is not to say I haven't run into issue but I'm still often shocked at how far it gets. And that's today. I have no idea what to expect in 6 months, 12, 2 years, 4, etc.

travisjungroth · 10h ago

Copied from a past comment of mine:

I just made up this scenario and these words, so I'm sure it wasn't in the training data.

Kwomps can zark but they can't plimf. Ghirns are a lot like Kwomps, but better zarkers. Plyzers have the skills the Ghirns lack.

Quoning, a type of plimfing, was developed in 3985. Zhuning was developed 100 years earlier. I have an erork that needs to be plimfed. Choose one group and one method to do it.

> Use Plyzers and do a Quoning procedure on your erork.

If that doesn't count as reasoning or generalization, I don't know what does.

firesteelrain · 10h ago

It’s just a truth table. I had a hunch that it was a truth table and then I asked AI how it figured it out and it confirmed it built a truth table. Still impressive either way

* Goal: Pick (Group ∧ Method) such that Group can plimf ∧ Method is a type of plimfing

* Only one group (Plyzers) passes the "can plimf" test

* Only one method (Quoning) is definitely plimfing

Therefore, the only valid (Group ∧ Method) combo is: → (Plyzer ∧ Quoning)

Source: ChatGPT

travisjungroth · 10h ago

So? Is the standard now that reasoning using truth tables or reasoning that can be expressed as truth tables doesn’t count?

krackers · 9h ago

If anything you'd think that the neurosymbolic people would be pleased that the LLMs do in fact reason by learning circuits representing boolean logic and truth tables. In a way they were right, it's just that starting with logic and then feeding in knowledge grounded in that logic (like Cyc) seems less scalable than feeding in knowledge and letting the model infer the underlying logic.

firesteelrain · 8h ago

Right, that’s my point. LLMs are doing pattern abstraction and in this way can mimic logic. They are not trained explicitly to do just truth tables even thought truth tables are fundamental.

goalieca · 14h ago

So far they cannot even answer questions which are straight up fact checking and search engine like queries. Reasoning means they would be able to work through a problem and generate a proof they way a student might.

Workaccount2 · 11h ago

So if they have bad memory, then they must be reasoning to get the correct answer for the problems they do solve?

Jensson · 10h ago

A clock that is right twice a day is still broken.

Workaccount2 · 9h ago

I think it's more fair to say a clock that is wrong twice a day is still broken...

astrange · 6h ago

> It’s patently obvious to me that LLMs can reason and solve novel problems not in their training data.

So can real parrots. Parrots are pretty smart creatures.

bfung · 15h ago

Any links or examples available? Curious to try it out

labrador · 15h ago

I've done this excercise dozens of times because people keep saying it, but I can't find an example where this is true. I wish it was. I'd be solving world problems with novel solutions right now.

People make a common mistake by conflating "solving problems with novel surface features" with "reasoning outside training data." This is exactly the kind of binary thinking I mentioned earlier.

jjaksic · 13h ago

"Solving novel problems" does not mean "solving world problems that even humans are unable to solve", it simply means solving problems that are "novel" compared to what's in the training data.

Can you reason? Yes? Then why haven't you cured cancer? Let's not have double standards.

jhanschoo · 13h ago

I think that "solving world problems with novel solutions" is a strawman for an ability to reason well. We cannot solve world problems with reasoning, because pure reasoning has no relation to reality. We lack data and models about the world to confirm and deny our hypotheses about the world. That is why the empirical sciences do experiments instead of sit in an armchair and mull all day.

andrewmcwatters · 14h ago

It's definitely not true in any meaningful sense. There are plenty of us practitioners in software engineering wishing it was true, because if it was, we'd all have genius interns working for us on Mac Studios at home.

It's not true. It's plainly not true. Go have any of these models, paid, or local try to build you novel solutions to hard, existing problems despite being, in some cases, trained on literally the entire compendium of open knowledge in not just one, but multiple adjacent fields. Not to mention the fact that being able to abstract general knowledge would mean it would be able to reason.

They. Cannot. Do it.

I have no idea what you people are talking about because you cannot be working on anything with real substance that hasn't been perfectly line fit to your abundantly worked on problems, but no, these models are obviously not reasoning.

I built a digital employee and gave it menial tasks that compare to current cloud solutions who also claim to be able to provide you paid cloud AI employees and these things are stupider than fresh college grads.

multjoy · 15h ago

Lol, no.

lossolo · 15h ago

They can't create anything novel and it's patently obvious if you understand how they're implemented. But I'm just some anonymous guy on HN, so maybe this time I will just cite the opinion of the DeepMind CEO, who said in a recent interview with The Verge (available on YouTube) that LLMs based on transformers can't create anything truly novel.

jjaksic · 12h ago

Since when is reasoning synonymous with invention? All humans with a functioning brain can reason, but only a tiny fraction have or will ever invent anything.

labrador · 15h ago

"I don't think today's systems can invent, you know, do true invention, true creativity, hypothesize new scientific theories. They're extremely useful, they're impressive, but they have holes."

Demis Hassabis On The Future of Work in the Age of AI (@ 2:30 mark)

https://www.youtube.com/watch?v=CRraHg4Ks_g

lossolo · 15h ago

Yes, this one. Thanks

gjm11 · 13h ago

He doesn't say "that LLMs based on transformers can't create anything truly novel". Maybe he thinks that, maybe not, but what he says is that "today's systems" can't do that. He doesn't make any general statement about what transformer-based LLMs can or can't do; he's saying: we've interacted with these specific systems we have right now and they aren't creating genuinely novel things. That's a very different claim, with very different implications.

Again, for all I know maybe he does believe that transformer-based LLMs as such can't be truly creative. Maybe it's true, whether he believes it or not. But that interview doesn't say it.

aucisson_masque · 15h ago

That’s the opposite of reasoning tho. Ai bros want to make people believe LLM are smart but they’re not capable of intelligence and reasoning.

Reasoning mean you can take on a problem you’ve never seen before and think of innovative ways to solve it.

LLM can only replicate what is in its data, it can in no way think or guess or estimate what will likely be the best solution, it can only output a solution based on a probability calculation made on how frequent it has seen this solution linked to this problem.

labrador · 14h ago

You're assuming we're saying LLMs can't reason. That's not what we're saying. They can execute reasoning-like processes when they've seen similar patterns, but this breaks down when true novel reasoning is required. Most people do the same thing. Some poeple can come up with novel solutions to new problems, but LLMs will choke. Here's an example:

Prompt: "Let's try a reasoning test. Estimate how many pianos there are at the bottom of the sea."

I tried this on three advanced AIs* and they all choked on it without further hints from me. Claude then said:

    Roughly 3 million shipwrecks on ocean floors globally
    Maybe 1 in 1000 ships historically carried a piano (passenger ships, luxury vessels)
    So ~3,000 ships with pianos sunk
    Average maybe 0.5 pianos per ship (not all passenger areas had them)
    Estimate: ~1,500 pianos

*Claude Sonnet 4, Google Gemini 2.5 and GPT 4o

kgeist · 13h ago

GPT4o isn't considered an "advanced" LLM at this point. It doesn't use reasoning.

I gave your prompt to o3 pro, and this is what I got without any hints:

  Historic shipwrecks (1850 → 1970)
  • ~20 000 deep water wrecks recorded since the age of steam and steel  
  • 10 % were passenger or mail ships likely to carry a cabin class or saloon piano   
  • 1 piano per such vessel 20 000 × 10 % × 1 ≈ 2 000

  Modern container losses (1970 → today)
  • ~1 500 shipping containers lost at sea each year  
  • 1 in 2 000 containers carries a piano or electric piano   
  • Each piano container holds ≈ 5 units   
  • 50 year window 1 500 × 50 / 2 000 × 5 ≈ 190

  Coastal disasters (hurricanes, tsunamis, floods)
  • Major coastal disasters each decade destroy ~50 000 houses  
  • 1 house in 50 owns a piano   
  • 25 % of those pianos are swept far enough offshore to sink and remain (50 000 / 50) × 25 % × 5 decades ≈ 1 250

  Add a little margin for isolated one offs (yachts, barges, deliberate dumping): ≈ 300

  Best guess range: 3 000 – 5 000 pianos are probably resting on the seafloor worldwide.

yen223 · 12h ago

The difference between o3 and o4-mini is so substantial I think this is the reason why people can't agree on how capable LLMs are nowadays.

theendisney · 8h ago

The correct answer is: I'm sorry, I don't have time for this.

FINDarkside · 13h ago

What does "choked on it" mean for you? Gemini 2.5 pro gives this, even estimating what amouns of those 3m ships that sank after pianos became common item. Not pasting the full reasoning here since it's rather long.

Combining our estimates:

From Shipwrecks: 12,500 From Dumping: 1,000 From Catastrophes: 500 Total Estimated Pianos at the Bottom of the Sea ≈ 14,000

Also I have to point out that 4o isn't a reasoning model and neither is Sonnet 4, unless thinking mode was enabled.

Jabrov · 14h ago

That seems like a totally reasonable response ... ?

labrador · 14h ago

I think you missed the part where I had to give them hinits to solve it. All 3 initially couldn't or refused saying it was not a real problem on their first try.

ej88 · 13h ago

Can you share the chats? I tried with o3 and it gave a pretty reasonable answer on the first try.

https://chatgpt.com/share/684e02de-03f0-800a-bfd6-cbf9341f71...

Jabrov · 12h ago

You must be on the wrong side of an A/B test or very unlucky.

Because I gave your exact prompt to o3, Gemini, and Claude and they all produced reasonable answers like above on the first shot, with no hints, multiple times.

gjm11 · 13h ago

FWIW I just gave a similar question to Claude Sonnet 4 (I asked about something other than pianos, just in case they're doing some sort of constant fine-tuning on user interactions[1] and to make it less likely that the exact same question is somewhere in its training data[2]) and it gave a very reasonable-looking answer. I haven't tried to double-check any of its specific numbers, some of which don't match my immediate prejudices, but it did the right sort of thing and considered more ways for things to end up on the ocean floor than I instantly thought of. No hints needed or given.

[1] I would bet pretty heavily that they aren't, at least not on the sort of timescale that would be relevant here, but better safe than sorry.

[2] I picked something a bit more obscure than pianos.

dialup_sounds · 13h ago

How much of that is inability to reason vs. being trained to avoid making things up?

Dzugaru · 2h ago

> just as humans shouldn’t serve as calculators

But they definitely could and were [0]. You just employ multiple, and cross check - with the ability of every single one to also double check and correct errors.

LLMs cannot double check, and multiples won't really help (I suspect ultimately for the same reason - exponential multiplication of errors [1])

[0] https://en.wikipedia.org/wiki/Computer_(occupation)

[1] https://www.tobyord.com/writing/half-life

ummonk · 15h ago

Most of the objections and their counterarguments seem like either poor objections (e.g. ad hominem against the first listed author) or seem to be subsumed under point 5. It’s annoying that most of this post focuses so much effort on discussing most of the other objections when the important discussion is the one to be had in point 5:

I.e. to what extent are LLMs able to reliably make use of writing code or using logic systems, and to what extent does hallucinating / providing faulty answers in the absence of such tool access demonstrate an inability to truly reason (I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?

thomasahle · 14h ago

> I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?

That's what the models did. They gave the first 100 steps, then explained how it was too much to output all of it, and gave the steps one would follow to complete it.

They were graded as "wrong answer" for this.

---

Source: https://x.com/scaling01/status/1931783050511126954?t=ZfmpSxH...

> If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large: "Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually"

> At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps.

sponnath · 10h ago

Didn't they start failing well before they hit token limits? I'm not sure what the point the source you linked to is trying to make.

thomasahle · 5h ago

OP said:

> I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?

And that's what the models did.

This is a good answer from the model. Has nothing to do with token limits.

emp17344 · 13h ago

Why should we trust a guy with the following twitter bio to accurately replicate a scientific finding?

>lead them to paradise

>intelligence is inherently about scaling

>be kind to us AGI

Who even is this guy? He seems like just another r/singularity-style tech bro.

FINDarkside · 14h ago

I don't think most of the objections are poor at all apart from 3, it's this article that seems to make lots of strawmans. Especially the first objection is often heard because people claim "this paper proves LLMs don't reason". The author moves goalposts and is arguing against about whether LLMs lead to AGI, which is already a strawman for those arguments. And in addition, he even seems to misunderstand AGI, thinking it's some sort of super intelligence ("We have every right to expect machines to do things we can’t"). AI that can do everything at least as good as average human is AGI by definition.

It's especially weird argument considering that LLMs are already ahead of humans in Tower of Hanoi. I bet average person will not be able to "one-shot" you the moves to 8 disk tower of Hanoi without writing anything down or tracking the state with the actual disks. LLMs have far bigger obstacles to reaching AGI though.

5 is also a massive strawman with the "not see how well it could use preexisting code retrieved from the web" as well, given that these models will write code to solve these kind of problems even if you come up with some new problem that wouldn't exist in its training data.

Most of these are just valid the issues in the paper. They're not supposed to be some kind of arguments that try to make everything the paper said invalid. The paper didn't really even make any bold claims, it only concluded LLMs have limitations in its reasoning. It had a catchy title and many people didn't read past that.

chongli · 11h ago

It's especially weird argument considering that LLMs are already ahead of humans in Tower of Hanoi

No one cares about Towers of Hanoi. Nor do they care about any other logic puzzles like this. People want AIs that solve novel problems for their businesses. The kind of problems regular business employees solve every single day yet LLMs make a mess of.

The purpose of the Apple paper is not to reveal the fact that LLMs routinely fail to solve these problems. Everyone who uses them already knows this. The paper is an argument for why this happens (lack of reasoning skills).

No number of demonstrations of LLMs solving well-known logic puzzles (or other problems humans have already solved) will prove reasoning. It's not interesting at all to solve a problem that humans have already solved (with working software to solve every instance of the problem).

ummonk · 13h ago

I'm more saying that points 1 and 2 get subsumed under point 5 - to the extent that existing algorithms / logical systems for solving such problems are written by humans, an AGI wouldn't need to match the performance of those algorithms / logical systems - it would merely need to be able to create / use such algorithms and systems itself.

You make a good point though that the question of whether LLMs reason or not should not be conflated with the question of whether they're on the pathway to AGI or not.

FINDarkside · 12h ago

Right, I agree there. Also that's something LLMs can already do. If you give the problem to ChatGPT o3 model, it will actually write python code, run it and give you the solution. But I think points 1 and 2 are still very valid things to talk about, because while Tower of Hanoi can be solved by writing code that doesn't apply to every problem that would require extensive reasoning.

neoden · 12h ago

> Puzzles a child can do

Certainly, I couldn't solve Hanoi's towers with 8 disks purely in my mind without being able to write down the state of every step or having a physical state in front of me. Are we comparing apples to apples?

bluefirebrand · 16h ago

I'm glad to read articles like this one, because I think it is important that we pour some water on the hype cycle

If we want to get serious about using these new AI tools then we need to come out of the clouds and get real about their capabilities

Are they impressive? Sure. Useful? Yes probably in a lot of cases

But we cannot continue the hype this way, it doesn't serve anyone except the people who are financially invested in these tools.

senko · 15h ago

Gary Marcus isn't about "getting real", it's making a name for himself as a contrarian to the popular AI narrative.

This article may seem reasonable, but here he's defending a paper that in his previous article he called "A knockout blow for LLMs".

Many of his articles seem reasonable (if a bit off) until you read a couple dozen a spot a trend.

steamrolled · 15h ago

> Gary Marcus isn't about "getting real", it's making a name for himself as a contrarian to the popular AI narrative.

That's an odd standard. Not wanting to be wrong is a universal human instinct. By that logic, every person who ever took any position on LLMs is automatically untrustworthy. After all, they made a name for themselves by being pro- or con-. Or maybe a centrist - that's a position too.

Either he makes good points or he doesn't. Unless he has a track record of distorting facts, his ideological leanings should be irrelevant.

senko · 14h ago

He makes many very good points:

For example he continusly calls out AGI hype for what it is, and also showcases dangers of naive use of LLMs (eg. lawyers copy-pasting hallucinated cases into their documents, etc). For this, he has plenty of material!

He also makes some very bad points and worse inferences: that LLMs as a technology are useless because they can't lead to AGI, that hallucation makes LLMs useless (but then he contradicts himself in another article conceding they "may have some use"), that because they can't follow an algorithm they're useless, etc, that scaling laws are over therefore LLMs won't advance (he's been making that for a couple of years), that AI bubble will collapse in a few months (also a few years of that), etc.

Read any of his article (I've read too many, sadly) and you'll never come to the conclusion that LLMs might be a useful technology, or be "a good thing" even in some limited way. This just doesn't fit with reality I can observe with my own eyes.

To me, this shows he's incredibly biased. That's okay if he wants to be a pundit - I couldn't blame Gruber for being biased about Apple! But Marcus presents himself as the authority on AI, a scientist, showing a real and unbiased view on the field. In fact, he's as full of hype as Sam Altman is, just in another direction.

Imagine he was talking about aviation, not AI. 787 dreamliner crashes? "I've been saying for 10 years that airplanes are unsafe, they can fall from the sky!" Boeing the company does stupid shit? "Blown door shows why airplane makers can't be trusted" Airline goes bankrupt? "Air travel winter is here"

I've spoken to too many intelligent people who read Marcus, take him at his words and have incredibly warped views on the actual potential and dangers of AI (and send me links to his latest piece with "so this sounds pretty damning, what's your take?"). He does real damage.

Compare him with Simon Willison, who also writes about AI a lot, and is vocal about its shortcomings and dangers. Reading Simon, I never get the feeling I'm being sold on a story (either positive or negative), but that I learned something.

Perhaps a Marcus is inevitable as a symptom of the Internet's immune system to the huge amount of AI hype and bullshit being thrown around. Perhaps Gary is just fed up with everything and comes out guns blazing, science be damned. I don't know.

But in my mind, he's as much BSer as the AGI singularity hypers.

ImageDeeply · 13h ago

> Compare him with Simon Willison, who also writes about AI a lot, and is vocal about its shortcomings and dangers. Reading Simon, I never get the feeling I'm being sold on a story (either positive or negative), but that I learned something.

Very true!

sinenomine · 14h ago

Marcus' points routinely fail to pass scrutiny, nobody in the field takes him seriously. If you seek real scientifically interesting LLM criticism, read François Chollet and his Arc AGI series of evals.

adamgordonbell · 15h ago

This!

For all his complaints about llms, his writing could be generated by an llm with a prompt saying: 'write an article responding to this news with an essay saying that you are once again right that this AI stuff is overblown and will never amount to anything.'

woopsn · 13h ago

Given that the links work, the quotes were actually said, numbers are correct, cited research actually exists etc we can immediately rule that out.

2muchcoffeeman · 15h ago

What’s the argument here that he’s not considering all the information regarding GenAI?

That there’s a trend to his opinion?

If I consider all the evidence regarding gravity, all my papers will be “gravity is real”.

In what ways is he only choosing what he wants to hear?

senko · 14h ago

Replied elsewhere in the thread: https://news.ycombinator.com/item?id=44279283

To your example about gravity, I argue that he goes from "gravity is real" to "therefore we can't fly", and "yeah maybe some people can but that's not really solving gravity and they need to go down eventually!"

2muchcoffeeman · 3h ago

If your argument about my gravity example holds. That’s not really a good argument. Between Newtons death and the first powered flight was almost 200 years. Being all negative about gravity would be reasonable since a bunch of stuff had to happen.

I’m not sure I buy your longer argument either.

I have a feeling the nay sayers are right on this. The next leap in AI isn’t something we’re going to recognise. (Obviously it’s possible - humans exist)

ramchip · 11h ago

I was very put off by his article "A knockout blow for LLMs?", especially all the fuss he was making about using his own name as a verb to mean debunking AI hype...

ninjin · 10h ago

Marcus comes with a very standard cognitive science criticism of statistical approaches to artificial intelligence, many parts of which dates back to the late 50s from when the field was born and moved to distance itself from behaviourism. The worst part to me is not that his criticism is entirely wrong, but rather that it is obvious and yet peddled as something that those of us that develop statistical approaches are completely ignorant of. To make matters worse, instead of developing alternative approaches (like plenty of my colleagues in cognitive science do!), he simply reiterates pretty much the same points over and over and has done so at least for the last twenty or so years. He and others paint themselves as sceptics and bulwarks against the current hype (which I can assure you, I hate at least as much as they do). But, to me, they are cynics, not sceptics.

I try to maintain a positive and open mind of other researchers, but Marcus lost me pretty much at "first contact" when a student in the group who leaned towards cognitive science had us read "Deep Learning: A Critical Appraisal" by Marcus (2018) [1] back around when it was published. Finally I could get into the mind of this guy so many people were talking about! 27 pages and yet I learned next to nothing new as the criticism was just the same one we have heard for decades: "Statistical learning has limits! It may not lead to 'truly" intelligent machines!". Not only that, the whole piece consistently conflates deep learning and statistical learning for no reason at all, reads as if it was rushed (and not proofed), emphasises the author's research strongly rather than giving a broad overview, etc. In short, it is bad, very bad as a scientific piece. At times, I read short excerpts of an article Marcus has written and yet sadly it is pretty much the same thing all over again.

[1]: https://arxiv.org/abs/1801.00631

There is a horrible market to "sell" hype when it comes to artificial intelligence, but there is also a horrible market to "sell" anti-hype. Sadly, both brings traffic, attention, talk invitations, etc. Two largely unscientific tribes, that I personally would rather do without, with their own profiting gurus.

newswasboring · 14h ago

What exactly is your objection here? That the guy has an opinion and is writing about it?

senko · 14h ago

Replied elsewere in the thread: https://news.ycombinator.com/item?id=44279283

fhd2 · 16h ago

Even of the people invested in these tools, hype only benefits those attempting a pump and dump scheme, or those selling training, consulting or similar services around AI.

People who try to make genuine progress, while there's more money in it now, might just have to deal with another AI winter soon at this rate.

bluefirebrand · 16h ago

> hype only benefits those attempting a pump and dump scheme

I read some posts the other day saying Sam Altman sold off a ton of his OpenAI shares. Not sure if it's true and I can't find a good source, but if it is true then "pump and dump" does look close to the mark

aeronaut80 · 16h ago

You probably can’t find a good source because sources say he has a negligible stake in OpenAI. https://www.cnbc.com/amp/2024/12/10/billionaire-sam-altman-d...

bluefirebrand · 15h ago

Interesting

When I did a cursory search, this information didn't turn up either

Thanks for correcting me. I suppose the stuff I saw the other day was just BS then

aeronaut80 · 14h ago

To be fair I struggle to believe he’s doing it out of the goodness of his heart.

spookie · 15h ago

Think the same thing, we need more breakthroughs. Until then, it is still risky to rely on AI for most applications.

The sad thing is that most would take this comment the wrong way. Assuming it is just another doomer take. No, there is still a lot to do, and promissing the world too soon will only lead to disappointment.

Zigurd · 14h ago

This is the thing of it: "for most applications."

LLMs are not thinking. They way they fail, which is confidently and articulately, is one way they reveal there is no mind behind the bland but well-structured text.

But if I was tasked with finding 500 patents with weak claims or claims that have been litigated and knocked down, I would turn into LLMs to help automate that. One or two "nines" of reliability is fine, and LLMs would turn this previously impossible task into something plausible to take on.

mountainriver · 16h ago

I’ll take critiques from someone who knows what a test train split is.

The idea that a guy so removed from machine learning has something relevant to say about its capabilities really speaks to the state of AI fear

Spooky23 · 15h ago

The idea that practitioners would try to discredit research to protect the golden goose from critique speaks to human nature.

mountainriver · 14h ago

No one is discrediting research from valid places, this is the victim alt-right style narrative that seems to follow Gary Marcus around. Somehow the mainstream is "suppressing" the real knowledge

devwastaken · 15h ago

experts are often blinded by their paychecks to see how nonsense their expertise is

mountainriver · 14h ago

Not knowing the most basic things about the subject you are critiquing is utter nonsense. Defending someone who does this is even worse

bluefirebrand · 7h ago

I think it's pretty fair to be critical of what LLMs are producing and how they fit into the tools without necessarily understanding how they work

If you bought a chainsaw that broke when you tried to cut down a tree, then you can criticize the chainsaw without knowing how the motor on it works, right?

soulofmischief · 15h ago

[citation needed]

Spooky23 · 15h ago

Remember Web 3.0? Lol

Zigurd · 14h ago

It's unfortunate that a discussion about LLM weaknesses is giving crypto bro. But telling. There are a lot of bubble valuations out there.

bandrami · 15h ago

How actually useful are they though? We've had more than a year now of saying these things 10X knowledge workers and creatives, so.... where is the output? Is there a new office suite I can try? 10 times as many mobile apps? A huge new library of ebooks? Is this actually in practice producing things beyond Ghibli memes and RETVRN nostalgia slop?

2muchcoffeeman · 15h ago

I think it largely depends on what you’re writing. I’ve had it reply to corporate emails which is good since I need to sound professional not human.

If I’m coding it still needs a lot of baby sitting and sometimes I’m much faster than it.

Gigachad · 15h ago

And then the person on the end is using AI to summarise the email back to normal English. To what end?

js8 · 14h ago

But look the GDP has increased!

bandrami · 14h ago

But that's what I don't get: it hasn't in that scenario because that doesn't lead to a greater circulation of money at any point. And that's the big thing I'm looking for: something AI has created that consumers are willing to pay for. Because if that doesn't end up happening no amount of sunk investment is going to save the ecosystem.

bandrami · 14h ago

So this would be an interesting output to measure but I have no idea how we would do that: has the volume of corporate email gone up? Or the time spent creating it gone down?

bigyabai · 15h ago

There's something innately funny about "HN's undying optimism" and "bad-news paper from Apple" reaching a head like this. An unstoppable object is careening towards an impervious wall, anything could happen.

DiogenesKynikos · 15h ago

I don't understand what people mean when they say that AI is being hyped.

AI is at the point where you can have a conversation with it about almost anything, and it will answer more intelligently than 90% of people. That's incredibly impressive, and normal people don't need to be sold on it. They're just naturally impressed by it.

woopsn · 14h ago

If the claims about AI were that it is a great or even incredible chat app, there would be no mismatch.

I think normal people understand curing all disease, replacing all value, generating 100x stock market returns, uploading our minds etc to be hype.

I said a few days ago, LLM is amazing product. Sad that these people ruin their credibility immediately upon success.

FranzFerdiNaN · 15h ago

I don’t need a tool that’s right maybe 70% of the time (and that’s me being optimistic). It needs to be right all the time or at least tell you when it doesn’t know for sure, instead of just making up something. Comparing it to going out in the streets and asking random people random questions is not a good comparison.

amohn9 · 13h ago

It might not fit your work, but there are tons of areas where “good enough” can still provide a lot of value. I’m sure you’d be thrilled with a tool that could correctly tell you if Apple’s stock was going up or down tomorrow 70% of the time.

chongli · 1h ago

I work in a mail room sending hard copy letters to customers. If I got my job right only 70% of the time then I’d be causing massive privacy breaches daily by sending the wrong personal information to the wrong customers.

Would you trust an AI that gets your banking transactions right only 70% of the time?

newswasboring · 14h ago

> I don’t need a tool that’s right maybe 70% of the time (and that’s me being optimistic).

Where are you getting this from? 70%?

travisgriggs · 15h ago

I get even better results talking to myself.

georgemcbay · 14h ago

AI, in the form of LLMs, can be a useful tool.

It is still being vastly overhyped, though, by people attempting to sell the idea that we are actually close to an AGI "singularity".

Such overhype is usually easy to handwave away as like not my problem. Like, if investors get fooled into thinking this is anything like AGI, well, a fool and his money and all that. But investors aside this AI hype is likely to have some very bad real world consequences based on the same hype-men selling people on the idea that we need to generate 2-4 times more power than we currently do to power this godlike AI they are claiming is imminent.

And even right now there's massive real world impact in the form of say, how much grok is polluting Georgia.

hellohello2 · 15h ago

Its quite simple, people upvote content that makes them feel good. Most of us here are programmers and the idea that many of ours skills are becoming replaceable feels quite bad. Hence, people upvote delusional statements that let them believe in something that feels better than objective reality. With any luck, these comments will be scraped and used to train the next AI generation, relieving it from the burden of factuality at last.

hrldcpr · 16h ago

In case anyone else missed the original paper (and discussion):

https://news.ycombinator.com/item?id=44203562

dang · 14h ago

Thanks! Macroexpanded:

The Illusion of Thinking: Strengths and limitations of reasoning models [pdf] - https://news.ycombinator.com/item?id=44203562 - June 2025 (269 comments)

Also this: A Knockout Blow for LLMs? - https://news.ycombinator.com/item?id=44215131 - June 2025 (48 comments)

Were there others?

thomasahle · 14h ago

> 5. A student might complain about a math exam requiring integration or differentiation by hand, even though math software can produce the correct answer instantly. The teacher’s goal in assigning the problem, though, isn’t finding the answer to that question (presumably the teacher already know the answer), but to assess the student’s conceptual understanding. Do LLM’s conceptually understand Hanoi? That’s what the Apple team was getting at. (Can LLMs download the right code? Sure. But downloading code without conceptual understanding is of less help in the case of new problems, dynamically changing environments, and so on.)

Why is he talking about "downloading" code? The LLMs can easily "write" out out the code themselves.

If the student wrote a software program for general differentiation during the exam, they obviously would have a great conceptual understanding.

autobodie · 14h ago

If the student could reference notes a fraction of the size of the LLM then I would not be convinced.

Workaccount2 · 11h ago

LLMs are (suspected) a few TB in size.

Gemma 2 27B, one of the top ranked open source models, is ~60GB in size. LLama 405B is about 1TB.

Mind you that they train on likely exabytes of data. That alone should be a strong indication that there is a lot more than memory going on here.

sigotirandolas · 22m ago

I'm not convinced by this argument. You can fit a bunch of books covering up to MSc level maths on less than 100MB. After that point, more books will mostly be redundant information so it doesn't need much more space for maths beyond that.

Similarly TBs of Twitter/Reddit/HN add near zero new information per comment.

If anything you can fit an enormous amount of information in 1MB - we just don't need to do it because storage is cheap.

exe34 · 13h ago

I suspect human memory consists of a lot more bits than LLMs encode.

autobodie · 13h ago

I rest my case — the question concerns a quality, not a quantity. These juvenile comparisons are mere excuses.

exe34 · 3h ago

Oh we've shifted the goal post to quality now, very good! That does rest the case.

starchild3001 · 13h ago

We built planes—critics said they weren't birds. We built submarines—critics said they weren't fish. Progress moves forward regardless.

You have a choice: master these transformative tools and harness their potential, or risk being left behind by those who do.

Pro tip: Endless negativity from the same voices won't help you adapt to what's coming—learning will.

sponnath · 10h ago

Toxic positivity is also not good.

hellojimbo · 6h ago

The only real point is number 5.

> Huge vindication for what I have been saying all along: we need AI that integrates both neural networks and symbolic algorithms and representations

This is basically agents which is literally what everyone has been talking about for the past year lol.

> (Importantly, the point of the Apple paper goal was to see how LRM’s unaided explore a space of solutions via reasoning and backtracking, not see how well it could use preexisting code retrieved from the web.

This is a false dichotomy. The thing that apple tested was dumb and dl'ing code from the internet is also dumb. What would've been interesting is, given the problem, would a reasoning agent know how to solve the problem with access to a coding env.

> Do LLM’s conceptually understand Hanoi?

Yes and the paper didn't test for this. The paper basically tested the equivalent of, can a human do hanoi in their head.

I feel like what the author is advocating for is basically a neural net that can send instructions to an ALU/CPU, but I haven't seen anything promising that shows that its better than just giving an agent access to a terminal

skywhopper · 16h ago

The quote from the Salesforce paper is important: “agents displayed near-zero confidentiality awareness”.

Illniyar · 11h ago

I find it weird that people are taking the original paper to be some kind of indictment against llms. It's not like LLMs failing at doing Hanoi tower problem at higher levels is new, the paper took an existing method that was done before.

It was simply comparing the effectiveness of reasoning and non reasoning models on the same problem.

hiddencost · 16h ago

Why do we keep posting stuff from Gary? He's been wrong for decades but he keeps writing this stuff.

As far as I can tell he's the person that people reach for when they want to justify their beliefs. But surely being this wrong for this wrong should eventually lead to losing ones status as an expert.

jakewins · 16h ago

I thought this article seemed like well articulated criticism of the hype cycle - can you be more specific what you mean? Are the results in the Apple paper incorrect?

astrange · 16h ago

Gary Marcus always, always says AI doesn't actually work - it's his whole thing. If he's posted a correct argument it's a coincidence. I remember seeing him claim real long-time AI researchers like David Chapman (who's a critic himself) were wrong anytime they say anything positive.

(em-dash avoided to look less AI)

Of course, the main issue with the field is the critics /should/ be correct. Like, LLMs shouldn't work and nobody knows why they work. But they do anyway.

So you end up with critics complaining it's "just a parrot" and then patting themselves on the back, as if inventing a parrot isn't supposed to be impressive somehow.

foldr · 15h ago

I don’t read GM as saying that LLMs “don’t work” in a practical sense. He acknowledges that they have useful applications. Indeed, if they didn’t work at all, why would he be advocating for regulating their use? He just doesn’t think they’re close to AGI.

kadushka · 15h ago

The funny thing is, if you asked “what is AGI” 5 years ago, most people would describe something like o3.

foldr · 15h ago

Even Sam Altman thinks we’re not at AGI yet (although of course it’s coming “soon”).

kadushka · 13h ago

Markus has been consistently wrong over the many years predicting the (lack of) progress of the current deep learning methods. Altman has been correct so far.

foldr · 5h ago

Marcus has made some good predictions and some bad ones. That’s usually the way with people who make specific predictions — there are no prophets.

Not sure I’d agree that SA has been any more consistently right. You can easily find examples of overconfidence from him (though he rarely says anything specific enough to count as a prediction).

barrkel · 16h ago

You need to read everything that Gary writes with the particular axe to grind he has in mind: neurosymbolic AI. That's his specialism, and he essentially has a chip in his shoulder about the attention probabilistic approaches like LLMs are getting, and their relative success.

You can see this in this article too.

The real question you should be asking is if there is a practical limitation in LLMs and LRMs revealed by the Hanoi Towers problem or not, given that any SOTA model can write code to solve the problem and thereby solve it with tool use. Gary frames this as neurosymbolic, but I think it's a bit of a fudge.

krackers · 15h ago

Hasn't the symbolic vs statistical split in AI existed for a long time? With things like Cyc growing out of the former. I'm not too familiar with linguistics but maybe this extends there too, since I think Chomsky was heavy on formal grammars over probabilistic models [1].

Must be some sort of cognitive sunk cost fallacy, after dedicating your life to one sect, it must be emotionally hard to see the other "keep winning". Of course you'd root for them to fall.

[1] https://norvig.com/chomsky.html

charcircuit · 13h ago

>with tool use

A LLM with tool use can solve anything. It is interesting to try and measure its capabilities without tools.

barrkel · 2h ago

I don't think the first is true at all, unless you imagine some powerful oracle tools.

I think the second is interesting for comparing models, but not interesting for determining the limits of what models can automate in practice.

It's the prospect of automating labour which makes AI exciting and revolutionary, not their ability when arbitrarily restricted.

NoahZuniga · 16h ago

None of the arguments presented in this piece depend on his authority as an expert, so this is largely irrelevant.

mountainriver · 15h ago

It’s insane, he doesn’t know what a test train split is but he’s an AI expert? Is this where we are?

marvinborner · 15h ago

Is this supposed to be a joke reflecting point (3)?

brcmthrowaway · 14h ago

In classic ML, you never evaluste against data that was in the training set. In LLMs, everything is the training set. Doesn't this seem wrong?

avsteele · 16h ago

This doesn't rebut anything from the best critique of the Apple paper.

https://arxiv.org/abs/2506.09250

Jabbles · 16h ago

Those are points (2) and (5).

foldr · 16h ago

It does rebut point (1) of the abstract. Perhaps not convincingly, in your view, but it does directly addresses this kind of response.

avsteele · 15h ago

Papers make specific conclusions based on specific data. The paper I linked specifically rebuts the conclusions of the paper. Gary makes vague statements that could be interpreted as being related.

It is scientific malpractice to write a post supposedly rebutting responses to a paper and not directly address the most salient one.

foldr · 15h ago

This sort of omission would not be considered scientific malpractice even in a journal article, let alone a blog post. A rebuttal of a position that fails to address the strongest arguments for it is a bad rebuttal, but it’s not scientific malpractice to write a bad paper — let alone a bad blog post.

I don’t think I agree with you that GM isn’t addressing the points in the paper you link. But in any case, you’re not doing your argument any favors by throwing in wild accusations of malpractice.

avsteele · 15h ago

Malpractice slightly hyperbolic.

But anybody relying on Gary's posts in order to be be informed on this subject is being being mislead. This isn't an isolated incident either.

People need to be made be aware when you read him it is mere punditry, not substantive engagement with the literature.

spookie · 15h ago

A paper citing arxiv papers and x.com doesn't pass my smell test tbh

revskill · 8h ago

I'm shorting Apple.

akomtu · 9h ago

It's easy to check if a blackbox AI can reason: give it a checkerboard pattern, or something more complex, and see if it can come up with a compact formula that generates this pattern. You can't bullshit your way thru this problem, and it's easy to verify the answer, yet none of these so-called researchers attempt to do this.

baxtr · 13h ago

The last paragraph:

>Talk about convergence evidence. Taking the SalesForce report together with the Apple paper, it’s clear the current tech is not to be trusted.

bowsamic · 15h ago

This doesn’t address the primary issue: that they had no methodology for choosing puzzles that weren’t in the training set and indeed while they claimed to have chosen puzzles that aren’t they didn’t explain why they think that. The whole point of the paper was to test LLM reasoning in untrained cases but there’s no reason to expect such puzzles to not part of the training set, and if you don’t have any way of telling if it is not or then your paper is not going to work out

roywiggins · 15h ago

Isn't it worse for LLMs if an LLM that has been trained on the Towers of Hanoi still can't solve it reliably?

bowsamic · 7h ago

Yes

anonthrowawy · 15h ago

how could you prove that?

bowsamic · 7h ago

You couldn’t, so such a paper cannot be scientific

(Or it should not be based on that claim as a central point, which apples paper was)

mentalgear · 15h ago

AI hype-bros like to complain that real AI experts are too much concerned about debunking current AI then improving it - but the truth is that debunking bad AI IS improving AI. Science is a process of trial and error which only works by continuously questioning the current state.

dang · 14h ago

Can you please make your substantive points without name-calling or swipes? This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.

No comments yet

neepi · 15h ago

Indeed. I completely agree with this.

My objection to the whole thing is the AI hype bros, which is really the funding solicitation facade over everything rather the truth, only has one outcome and that is that it cannot be sustained. At that point all investor confidence disappears, the money is gone and everyone loses access to the tools that they suddenly built all their dependencies on because it's all proprietary service model based.

Which is why I am not poking it with a 10 foot long shitty stick any time in the near future. The failure mode scares me, not the technology which arguably does have some use in non-idiot hands.

wongarsu · 14h ago

A lot of the best internet services came around in the decade after the dot-com crash. There is a chance Anthropic or OpenAI may not survive when funding suddenly dries up, but existing open weight models won't be majorly impacted. There will always be someone willing to host DeepSeek for you if you're willing to pay.

And while it will be sad to see model improvements slow down when the bubble bursts there is a lot of untapped potential in the models we already have. Especially as they become cheaper and easier to run

neepi · 14h ago

Someone might host DeepSeek for you but you'll pay through the nose for it and it'll be frozen in time because the training cost doesn't have the revenue to keep the ball rolling.

I'm not sure the GPU market won't collapse with it either. Possibly taking out a chunk of TSMC in the process, which will then have knock on effects across the whole industry.

wongarsu · 14h ago

There are already inference providers like DeepInfra or inference.net whose entire business model is hosted inference of open-source models. They promise not to keep or use any of the data and their business model has no scaling effects, so I assume they are already charging a fair market rate where the price covers the costs and returns a profit.

The GPU market will probably take a hit. But the flip side of that is that the market will be flooded with second-hand enterprise-grade GPUs. And if Nvidia needs sales from consumer GPUs again we might see more attractive prices and configurations there too. In the short term a market shock might be great for hobby-scale inference, and maybe even training (at the 7B scale). In the long term it will hurt, but if all else fails we still have AMD who are somehow barely invested in this AI boom

xoac · 15h ago

Yeah this is history repeating. See for example less known “Dreyfuss affair” at MIT and the brilliantly titled books: “What Computers Can’t Do” and its sequel “What Computers Still Can’t Do”.

3abiton · 14h ago

To hammer one point though, you have to understand that researcher are desensitized to minor novel improvement that translate to great value products. While obviously studying and assessing the limitations of AI is crucial, to the general public its capabilities are just so amazing, they can't fathom why we should think about limitations. Optimizing what we have is bette than rethinking the whole process.

bobxmax · 15h ago

> AI hype-bros like to complain that real AI experts are too much concerned about debunking current AI then improving it

You're acting like this is a common ocurrence lol