The latest AI scaling graph – and why it hardly makes sense

30 nsoonhui 17 5/4/2025, 7:01:29 AM garymarcus.substack.com ↗

Comments (17)

Sharlin · 7h ago
> Unfortunately, literally none of the tweets we saw even considered the possibility that a problematic graph specific to software tasks might not generalize to literally all other aspects of cognition.

How am I not surprised?

yorwba · 7h ago
> you could probably put together one reasonable collection of word counting and question answering tasks with average human time of 30 seconds and another collection with an average human time of 20 minutes where GPT-4 would hit 50% accuracy on each.

So do this and pick the one where humans do best. I doubt that doing so would show all progress to be illusory.

But it would certainly be interesting to know what the easiest thing is that a human can do but current AIs struggle with.

K0balt · 1h ago
The problem , really, is human cognitive dissonance. We draw false conclusions that competence at some tasks implies competence at another. It’s not a universal human problem, we intuit that a front end loader , just because it can dig really well, is not therefore good at all other tasks. But when it comes down to cognition, our models break down quickly.

I suspect this is because our proxies are predicated on a task set that inherently includes the physical world, which at some level connects all tasks and creates links between capabilities that generally pervade our environment. LLMs do not exist in this physical world, and are therefore not within the set of things that can be reasoned about with those proxies.

This will probably gradually change with robotics, as the competencies required to exist and function in the physical world will (I postulate) generalize to other tasks in such a way that it more closely matches the pattern that our assumptions are based on.

Of course, if we segregate intelligence into isolated modules for motility and cognition, this will not be the case as we will not be taking advantage of that generalization. I think that would be a big mistake, especially in light of the hypotheses that the massive leap in capabilities of LLMs came more from the training on things we weren’t specifically trying to achieve- the bulk of seemingly irrelevant data that unlocked simple language processing into reasoning and world modeling.

xg15 · 6h ago
> But it would certainly be interesting to know what the easiest thing is that a human can do but current AIs struggle with.

Still "Count the R's" apparently.

hatefulmoron · 8h ago
I had assumed that the Y axis was corresponding to some measurement of the LLM's ability to actually work/mull over a task in a loop while making progress. In other words, I thought it meant something like "you can leave Sonnet 3.7 for a whole hour and it will meaningfully progress on a problem", but the reality is less impressive. Serves me right for not looking at the fine print.
ReptileMan · 8h ago
The demand by a fraction of bay area intellectuals for ai disasters and doom of humanity way outstrips supply. The recent fanfic of Scott Alexander and other similar "thinkers" also is also worth checking out for a chuckle https://ai-2027.com/
tomhow · 3h ago
Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.

Please don't fulminate. Please don't sneer...

Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.

Eschew flamebait. Avoid generic tangents. Omit internet tropes.

https://news.ycombinator.com/newsguidelines.html

ben_w · 7h ago
AI is software.

As software gets more reliable, people come to trust it.

Software still has bugs, the trust means those bugs still get people killed.

That was true with things we wouldn't call AI any more, and still does with things we do.

Doesn't need to take over or anything when humans are literally asleep at the wheel because they mistakenly think the AI can drive the car for them.

Heck, even for building codes and health & safety rules, they're written in blood. Why would AI be the exception?

clauderoux · 6h ago
As Linus Thorval said in an interview recently, humans don't need AI to make bugs.
okthrowman283 · 8h ago
To be fair though the author of 2027 has been prescient in his previous predictions
dist-epoch · 7h ago
Turkey fallacy.

The apocalypse will only happen once. Just like global nuclear war.

The fact that there was not a global nuclear war until now doesn't mean all those fearing nuclear war are crazy irrational people.

pvg · 1h ago
Entire cities have been destroyed by nuclear bombs, the effects of nuclear weapons testing falout are measurable in everything around us. The risks are not even qualitatively comparable.
ReptileMan · 7h ago
No. It just means they are stupid in the way only extremely intelligent people could be
Sharlin · 7h ago
People being afraid of a nuclear war are stupid in a way only extremely intelligent people can be? Was that just something that sounded witty in your mind?

No comments yet

Nivge · 8h ago
TL;DR - the benchmark depends on its specific dataset, and it isn't a perfect representation to evaluate AI progress. That doesn't mean it doesn't make sense, or doesn't have value.
dist-epoch · 7h ago
> Abject failure on a task that many adults could solve in a minute

Maybe author should check before pressing "Publish" if the info in the post is not already outdated.

ChatGPT passed the image generation test mentioned: https://chatgpt.com/share/68171e2a-5334-8006-8d6e-dd693f2cec...

frotaur · 7h ago
Even excluding the fact that this image is simply to illustrate, and it's really not the main point of the article, in the chat you posted, ChatGPT actually failed again, because the r's are not circled.