Has anyone come up with a definition of AGI where humans are near-universally capable of GI? These articles seem to be slowly pushing the boundaries past the point where slower humans are disbarred from intelligence.
Many years ago I bumped in to Towers of Hanoi in a computer game and failed to solve it algorithmicly, so I suppose I'm lucky I only work a knowledge job rather than an intelligence-based one.
parodysbird · 6h ago
The original Turing Test was one of the more interesting standards... An expert judge talks with two subjects in order to determine which is the human: one is a human who knows the point of the test, and one is machine trying to fool the judge into being no better than a coin flip at correctly choosing who was human. Allow for many judges and experience in each etc.
The brilliance of the test, which was strangely lost on Turing, is that the test is doubtful to be passed with any enduring consistency. Intelligence is actually more of a social description. Solving puzzles, playing tricky games, etc is only intelligent if we agree that the actor involved faces normal human constraints or more. We don't actually think machines fulfill that (they obviously do not, that's why we build them: to overcome our own constraints), and so this is why calculating logarithms or playing chess ultimately do not end up counting as actual intelligence when a machine does them.
cardanome · 5h ago
People confuse performance and internal presentation.
A simple calculator is vastly better as adding numbers than any human. An chess engine will rival any human grand master. No one would say that this got us closer to AGI.
We could absolutely see LLMs that produce poetry that humans can not tell apart or even prefer to human made poetry. We could have LLMs that are perfectly able to convince humans that they have consciousness and emotions.
Would we have have achieved AGI then? Does that mean those LLMs have gotten consciousness and emotions? No.
The question of consciousness is based on what is going on in the inside, how the reasoning happening and not the output. In fact the first AGI might perform significantly worse in most tasks that current LLMs.
LLMs are extremely impressive but they are not thinking. They do not have consciousness. It might be technically impossible for them to develop anything like that or at least it would require significantly bigger models.
> where slower humans are disbarred from intelligence
Humans have value for being humans. Whether they are slow or fast at thinking. Whether they are neurodivergent or neurotypical. We all have feelings, we are all capable of suffering, we are all alive.
The best argument I've heard for why LLMs aren't there yet, is that they don't have a real world model. They only interact with text and images, and not with the real world. They have no concept of the real world, and therefore also no real concept of truth. They learn by interacting with text, not with the world.
I don't know if that argument is true, but it does make some sense.
In fact, I think you might argue that modern chess engines might have more of a world model (although an extremely limited one): they interact with the chess game. They learn not merely by studying the rules, but by playing the game millions of times. Of course that's only the "world" of the chess game, but it's something, and as a result, they know what works in chess. They have a concept of truth within the chess rules. Which is super limited of course, but it might be more than what LLMs have.
lostmsu · 29m ago
It doesn't make any sense. You aren't interacting with neutrinos either. Nothing really beyond some local excitations of electrical fields and EM waves in certain frequency range.
saberience · 5h ago
The problem with your argument is the idea that there is this special thing called "consciousness" that humans have and AI "doesn't".
Philosophers, scientists, thinkers have been trying to define "consciousness" for 100+ years at this point and no one has managed to either a) define it, or b) find ways to test for it.
Saying we have "consciousness" and AI "doesn't" is like saying we have a soul, a ghost in the machine, and AI doesn't. Do we really have a ghost in the machine? Or are we just really a big deterministic machine that we just don't fully understand yet, rather like AI.
So before you assert that we are "conscious", you should first define what you mean by that term and how we test for it conclusively.
staticman2 · 2h ago
Before you assert nobody has defined consciousness you should maybe consult the dictionary?
saberience · 1h ago
Are you trying to misunderstand me purposefully?
I'm talking about a precise, technical, scientific definition, that scientists all agree on, and which doesn't rely on the definitions of other words, and can also be reliably tested.
There has been constant debate about what consciousness means among scientists, philosophers, psychologists for as long as the word has existed. And there has never been any consistent and agreed upon test for consciousness.
The Google definition of consciousness is: "the state of being aware of and responsive to one's surroundings."
By that definition, a Tesla self driving car is conscious, it is aware of and responsive to its surroundings...
staticman2 · 35m ago
If you meant a scientific definition why do you keep mentioning philosophy?
An LLM tells me references such as Oxford Dictionary of Science probably include a definition of consciousness but I suppose that would be behind a pay wall so I can't verify it.
Of course you are demanding one that "scientists all agree on" which is an impossibly high bar so I don't think anyone is going to meet you there.
542354234235 · 4h ago
>The question of consciousness is based on what is going on in the inside, how the reasoning happening and not the output.
But we don’t really understand how the reasoning is happening in humans. Tests show that our subconscious, completely outside out conscious understanding, makes decisions before we perceive that we consciously decide something [1]. Our consciousness is the output, but we don’t really know what is running in the subconscious. If something looked at it from an outside perspective, would they say that it was just unconscious programing, giving the appearance of conscious reasoning?
I’m not saying LLMs are conscious. But since we don’t really know what gives us the feeling of consciousness, and we didn’t build and don’t understand the underlying “programing”, it is hard to actually judge a non-organic mind that claims the feeling of consciousness. If you found out today that you were actually a computer program, would you say you weren’t conscious? Would you be able to convince “real” people that you were conscious?
My point was we can can't prove that LLM's have consciousness. Yes, the reverse is also true. It is possible that we wouldn't really be able to tell if a AI gained consciousness as that might look very differently than we expect.
An important standard for any scientific theory or hypothesis is to be falsifiable . Good old Russell's teapot. We can't disprove that that a small teapot too small to be seen by telescopes, orbits the Sun somewhere in space between the Earth and Mars. So should we assume it is true? No the burden of proof lies on those that make the claim.
So yes, I could not 100 percent disprove that certain LLM's don't show signs of consciousness but that is reversing the burden of proof. Those that make the claims that LLM's are capable of suffering, that they show signs of consciousness need to deliver. If they can't, it is reasonable to assume they are full of shit.
People here accuse me to be scholastic and too philosophical but the the reverse is true. Yes, we barely know how human brains work and how consciousness evolved but whoever doesn't see the qualitative difference between a human being and an LLM really needs to touch grass.
Tadpole9181 · 3h ago
In one breath: scientific vigor required for your opposition.
In the next breath: "anyone who disagrees with me is a loser."
> Those that make the claims that LLM's are capable of suffering, that they show signs of consciousness need to deliver. If they can't, it is reasonable to assume they are full of shit.
Replace LLM with any marginalized group. Black people, Jews, etc. I can easily just use this to excuse any heinous crime I want - because you cannot prove that you aren't a philosophical zombie to me.
Defaulting to cruelty in the face of unfalsifiablility is absurd.
suddenlybananas · 3h ago
>Replace LLM with any marginalized group. Black people, Jews, etc. I can easily just use this to excuse any heinous crime I want - because you cannot prove that you aren't a philosophical zombie to me.
This is so flatly ridiculous of an analogy that it becomes racist itself. Maybe the bread I eat is conscious and feels pain (the ancient Manichaens thought so!). Are you know going to refrain from eating bread in case it causes suffering? You can't prove bread doesn't feel pain, you might be "defaulting to cruelty"!
_aavaa_ · 5h ago
Consciousness is irrelevant to discussions of intelligence (much less AGI) unless you pick a circular definition for both.
This is “how many angels dance on the head of a pin” territory.
cwillu · 5h ago
I would absolutely say both the calculator and strong chess engines brought us closer.
virgilp · 5h ago
> Does that mean those LLMs have gotten consciousness and emotions? No.
Is this a belief statement, or a provable one?
Lerc · 4h ago
I think it is clearly true that it doesn't show that they have consciousness and emotions.
The problem is that people assume that failing to show that they do means that they don't.
It's very hard to show that something doesn't have consciousness. Try and conclusively prove that a rock does not have consciousness.
virgilp · 4h ago
The problem with consciousness is kinda' the same as the problem with AGI. Trying to prove that someone/something has or does not have consciousness is largely, as a commenter said, the same as debating whether it has or not plipnikop. I.e. something that is not well defined or understood, that may mean different things to different people.
I think it's even hard to conclusively prove that LLMs don't have any emotions. They can definitely express emotions (typically don't, but largely because they're trained/tuned to avoid the expression of emotions). Now, are those fake? Maybe, most likely even... but not "clearly" (or provably) so.
Asraelite · 4h ago
Years ago when online discussion around this topic was mostly done by small communities talking about the singularity and such, I felt like there was a pretty clear definition.
Humans are capable of consistently making scientific progress. That means being taught knowledge about the world by their ancestors, performing new experiments, and building upon that knowledge for future generations. Critically, there doesn't seem to be an end to this for the foreseeable future for any field of research. Nobody is predicting that all scientific progress will halt in a few decades because after a certain point it becomes too hard for humans to understand anything, although that probably would eventually become true.
So an AI with at least the same capabilities as a human would be able to do any type of scientific research, including research into AI itself. This is the "general" part: no matter where the research takes it, it must always be able to make progress, even if slowly. Once such an AI exists, the singularity begins.
I think the fact that AI is now a real thing with a tangible economic impact has drawn the attention of a lot of people who wouldn't have otherwise cared about the long-term implications for humanity of exponential intelligence growth. The question that's immediately important now is "will this replace my job?" and so the definitions of AGI that people choose to use are shifting more toward definitions that address those questions.
suddenlybananas · 6h ago
Someone who can reliably solve towers of Hanoi with n=4 and who has been told the algorithm should be able to do it with n=6,7,8. Don't forget, these models aren't learning how to do it from scratch the way a child might.
morsecodist · 5h ago
AGI is a marketing term. It has no consistent definition. It's not very useful when trying to reason about AI's capabilities.
badgersnake · 5h ago
Mass Effect?
littlestymaar · 5h ago
> These articles seem to be slowly pushing the boundaries past the point where slower humans are disbarred from intelligence.
It's not really pushing boundaries, a non trivial amount of humans has always been excluded from the definition of “human intelligence” (and with the ageing of the population, this number is only going up), and it makes sense, like how you don't consider blind individuals when you're comparing humans sight to other animals'.
James_K · 6h ago
It may genuinely be the case that slower humans are not generally intelligent. But that sounds rather snobbish so it's not an opinion I'd like to express frequently.
I think the complaint made by apple is quite logical though and you mischaracterise it here. The question asked in the Apple study was "if I give you the algorithm that solves a puzzle, can you solve that puzzle?" The answer for most humans should be yes. Indeed, the answer is yes for computers which are not generally intelligent. Models failed to execute the algorithm. This suggests that the models are far inferior to the human mind in terms of their computational ability, which precedes general intelligence if you ask me. It seems to indicate that the models are using more of a "guess and check" approach than actually thinking. (A specifically interesting result was that model performance did not substantially improve between a puzzle with the solution algorithm given, and one where no algorithm was given.)
You can sort of imagine the human mind as the head of a Turing Machine which operates on language tokens, and the goal of an LLM is to imitate the internal logic of that head. This paper seems to demonstrate that they are not very good at doing that. It makes a lot of sense when you think about it, because the models work by consuming their entire input at once where the human mind operates with only a small working memory. A fundamental architectural difference which I suspect is the cause of the collapse noted in the Apple paper.
GrayShade · 6h ago
I think a human will struggle to solve Hanoi using the recursive algorithm for even 6 disks, even given pen and paper.
Does that change if you give them the algorithm description? No. Conversely, the LLMs already know the algorithm, so including it in the prompt makes no difference.
thaumasiotes · 5h ago
> I think a human will struggle to solve Hanoi using the recursive algorithm for even 6 disks, even given pen and paper.
Why? The whole point of the recursive algorithm is that it doesn't matter how many discs you're working with.
The ordinary children's toys that implement the puzzle are essentially always sold with more than 6 discs.
The recursive solution has a stack depth proportional to the number of disks. That's three pieces (two pegs and how many disks to move) of data for each recursive call, so for 6 disks the "stack" will contain up to around 15 values, which is generally higher than an unaided human will be able to track.
In addition, 64-256 moves is quite a lot and I suspect people will generally lose focus before completing them.
James_K · 3h ago
It's still pretty simple. I think you are really underestimating the ability of humans to do rote activities. Especially since we could give the human a pen and paper in this case and let them write stuff down on it to give parity to the AI which can write words in its output.
It seems pretty clear to me that learning how to go about following an arbitrary set of rules is a part of general intelligence. There are lots of humans who have learned this skill, and many (mostly children) who have not. If the AI has failed to learn this ability during its extensive training, and critically if it cannot be taught this ability as a human could, then it's certainly not "generally intelligent" to anywhere near the human degree.
suddenlybananas · 3h ago
You have to understand these people are fundamentally misanthropic, and think everyone but them is an idiot. That's why they're so impressed by LLMs even when they fail miserably.
thaumasiotes · 4h ago
You should try playing with one of the toys. It's not at all difficult to move 7 of them.
It's not necessary to use a stack. If you have a goal, you can work "top down", with nothing held in memory. All you need to know to begin the move is whether you're moving an odd number of discs (in which case, the first move will be onto the target peg) or an even number (in which case it will be onto the third peg).
GrayShade · 4h ago
Yes, I'm aware of the iterative solution, which is why I explicitly mentioned the recursive one.
They tried to give the algorithm description to the LLMs, but they also used the recursive solution (see page 25 of the paper).
thaumasiotes · 3h ago
What do you think a human using the recursive solution looks like?
If you ask someone how the puzzle works, they're overwhelmingly likely to tell you:
"To move disc 7, first you move discs 1-6 to the third peg, then you move disc 7[, and then you can put discs 1-6 back on top of it]."
This is an explicitly recursive statement of the solution. But the implementation is just that you count down from 7 to 1 while toggling which peg is "the third peg". You can optimize that further by dividing 7 by 2 and taking the remainder, but even if you don't do that, you're using constant space.
What would a human be doing (or thinking) differently, if they were using the recursive algorithm?
robertk · 5h ago
The Apple paper does not look at its own data — the model outputs become short past some thresholds because the models reflectively realize they do not have the context to respond in the steps as requested, and suggest a Python program instead, just as a human would. One of the penalized environments is proven impossible to solve in the literature for n>6, seemingly unaware to the authors. I consider this and more the definitive rebuttal of the sloppiness of the paper: https://www.alignmentforum.org/posts/5uw26uDdFbFQgKzih/bewar...
suddenlybananas · 3h ago
The n>6 result is to find the shortest solution not any solution. I don't get why people are so butthurt about this paper.
Herring · 5h ago
Apple's tune will completely change the second they get a leading LLM - Look at all the super important and useful things you can do with "Apple General Intelligence"!
jmsdnns · 5h ago
No, it wont. This comment essentially says science doesnt matter for anyone, only whether or not they're leading in marketing.
Lerc · 4h ago
I think it's closer to saying that those who are falling behind declare that the science doesn't matter.
jstanley · 4h ago
I think the comment says science doesn't matter for Apple, specifically.
throwaway287391 · 4h ago
As someone who used to write academic ML papers, it's funny to me that people are treating this academic style paper written by a few Apple researchers as Apple's official company-wide stance, especially given the first author was an intern.
I suppose it's "fair" since it's published on the Apple website with the authors' Apple affiliations, but historically speaking, at least in ML where publication is relatively fast-paced and low-overhead, academic papers by small teams of individual researchers have in no way reflected the opinions of e.g. the executives of a large company. I would not be particularly surprised to see another team of Apple researchers publishing a paper in the coming weeks with the opposite take, for example.
jmsdnns · 4h ago
The author is an intern, but they're also almost done with their PhD. They're not just any intern.
throwaway287391 · 4h ago
That's kind of expected for a research intern -- internships are most commonly done within 1-2 years before graduation. But in any case, the fact that the first author is an intern is just the cherry on top for me -- my comment would be the same modulo the "especially" remark if all the authors were full time research staff.
djoldman · 5h ago
> I would consider this a death blow paper to the current push for using LLMs and LRMs as the basis for AGI.
Anytime I see "Artificial General Intelligence," "AGI," "ASI," etc., I mentally replace it with "something no one has defined meaningfully."
Or the long version:
"something about which no conclusions can be drawn because the proposed definitions lack sufficient precision and completeness."
Or the short versions:
"Skippetyboop," "plipnikop," and "zingybang."
chrsw · 5h ago
One vague definition I see tossed around a lot "something can replace almost any human knowledge/white collar worker".
What does that mean in concrete terms? I'm not sure. Many of these models can already pass bar exams but how many can be lawyers? Probably none. What's missing?
thaumasiotes · 5h ago
> Probably none.
The qualification is unnecessary; we know the answer is "none". There's a steady stream of lawyers getting penalized for submitting LLM output to judges.
chrsw · 4h ago
You're right. I should have said "can ever". Both in terms of permitted to and in terms of have the capacity to. And I'm only referring to current machine learning architectures.
yahoozoo · 5h ago
I know it when I see it
pmarreck · 5h ago
The only definition by which porn, beauty, intelligence, aliveness, and creativity are all known
(seriously, forget this stuff: Consider how you come up with an algorithm for how creative or beautiful something is?)
A line that used to work (back when I was in a part of my life where lines were a thing) was "I can tell you're smart, which means that you can tell I'm smart, because like sees like." Usually got a smile out of 'em.
pmarreck · 5h ago
I like Sundar Pichai's: "Artificial Jagged Intelligence" (AJI)
coffeefirst · 5h ago
“The Messiah.” The believers know it’s coming and will transform the world in ways that don’t even make sense to outsiders. They do not change their mind when it doesn’t happen as foreseen.
pu_pe · 5h ago
The author's main point is that output token constraints should not be the root cause for poor performance in reasoning tests, as in many cases the LLMs did not even come close to exceeding their token budgets before giving up.
While that may be true, do we understand how LLMs behave according to token budget constraints? This might impact much simpler tasks as well. If we give them a task to list the names of all cities in the world according to population, do they spit out a python script if we give them a 4k output token budget but a full list if we give them 100k?
gjm11 · 5h ago
This rebuttal-of-a-rebuttal looks to me as if it gets one (fairly important) thing right but pretty much everything else wrong. (Not all in the same direction; the rebuttal^2 fails to point out what seems to me to be the single biggest deficiency in the rebuttal.)
The thing it gets right: the "Illusion of illusion" rebuttal claims that in the original "Illusion of Thinking" paper's version of the Towers of Hanoi problem, "The authors’ evaluation format requires outputting the full sequence of moves at each step, leading to quadratic token growth"; this doesn't seem to be true at all, and this "Beyond Token Limits" rebuttal^2 is correct to point it out.
(This implies, in particular, that there's something fishy in the IoI rebuttal's little table showing where 5(2^n-1)^2 exceeds the token budget, which they claim explains the alleged "collapse" at roughly those points.)
Things it gets wrong:
"The rebuttal conflates solution length with computational difficulty". This is just flatly false. The IoI rebuttal explicitly makes pretty much the same points as the BTL rebuttal^2 does here.
"The rebuttal paper’s own data contradicts its thesis. Its own data shows that models can generate long sequences when they choose to, but in the findings of the original Apple paper, it finds that models systematically choose NOT to generate longer reasoning traces on harder problems, effectively just giving up." I don't see anything in the rebuttal that "shows that models can generate long sequences when they choose to". What the rebuttal finds is that (specifically for the ToH problem) if you allow the models to answer by describing the procedure rather than enumerating all its steps, they can do it. The original paper didn't allow them to do this. There's no contradiction here.
"It instead completely ignores this finding [that once solutions reach a certain level of difficulty the models give up trying to give complete answers] and offers no explanation as to why models would systematically reduce computational effort when faced with harder problems."
The rebuttal doesn't completely ignore this finding. That little table of alleged ToH
token counts is precisely targeted at this finding. (It seems like it's wrong, which is important, but the problem here isn't that the paper ignores this issue, it's that it has a mistake that invalidates how it addresses the issue.)
Things that a good rebuttal^2 should point out but this rebuttal completely ignores:
The most glaring one, to me, is that the rebuttal focuses almost entirely on the Tower of Hanoi, where there's a plausible "the only problem is that there aren't enough tokens" issue, and largely ignores the other problems that the original paper also claims to find "collapse" problems with. Maybe token-limit issues are also sufficient explanation for the problems with other models (e.g., if something is effectively only solvable by exhaustive search, then maybe there aren't enough tokens for the model to do that search in) but the rebuttal never actually makes that argument (e.g., by estimating how many tokens are needed to do the relevant exhaustive search).
The rebuttal does point out what if correct is a serious problem with the original paper's treatment of the "River Crossing" problem (apparently the problem they asked the AI to solve is literally unsolvable for many of the cases they put to it), but the unsolvability starts at N=6 and the original paper finds that the models were unable to solve the problem starting at N=3.
(Anecdata: I had a go at solving the River Crossing problem for N=3 myself. I made a stupid mistake that stopped me finding a solution and didn't have sufficient patience to track it down. My guess is that if you could spawn many independent copies of me and ask them all to solve it, probably about 2/3 would solve it and 1/3 would screw up in something like the way actual-me did. If I actually needed to solve it for larger N I'd write some code, which I suspect the AI models could do about as well as I could. For what it's worth, I think the amount of text-editor scribbling I did while not solving the puzzle was quite a bit less than the thinking-token limits these models had.)
The rebuttal^2 does complain about the "narrow focus" of the rebuttal, but it means something else by that.
amelius · 6h ago
Is there anything falsifiable in Apple's paper?
No comments yet
low_tech_love · 8h ago
A fundamental problem that we’re still far away from solving is not necessarily that LLMs/LRMs cannot reason the same way that we do (which I guess should be clear by now); but that they might not have to. They generate slop so fast that, if one can benefit a little bit from each output, i.e. if you can find a little bit of use hidden beneath the mountain of meaningless text they’ll create, then this might still be more valuable than preemptively taking the time to create something more meaningful to begin with. I can’t say for sure what is the reward system behind LLM use in general, but given how much money people are willing to spend with models even in their current deeply flawed state, I’d say it’s clear that the time savings are outweighing the mistakes and shallowness.
Take the comment paper, for example. Since Claude Opus is the first author, I’m assuming that the human author took a backseat and let the AI build the reasoning and most of the writing. Unsurprisingly, it is full of errors and contradictions, to a point where it looks like the human author didn’t bother too much to check what was being published. One might say that the human author, in trying to build some reputation by showing that their model could answer a scientific criticism, actually did the opposite: it provided more evidence that its model cannot reason deeply, and maybe hurt their reputation even more.
But the real question is, did they really? How much backlash will they possibly get from submitting this to arxiv without checking? Would that backlash keep them from submitting 10 more papers next week with Claude as the first author? If one puts in a balance the amount of slop you can put out (with a slight benefit) vs. the bad reputation one gets from it, I cannot say that “human thinking” is actually worth it anymore.
practice9 · 6h ago
The human is a bad co-author here really.
I deployed lots of high performance, clean, well documented etc code generated by Claude or o3. I reviewed it wrt requirements, added tests and so on. Even with that in mind it allowed me to work 3x faster.
But it required conscious effort on my part to point out issues and inefficiencies on LLMs part.
It is a collaborative type of work where LLMs shine (even in so called agentic flows)
iLoveOncall · 7h ago
Mediocre people produce mediocre work. Using AI might make those mediocre people produce even worse work, but I don't think it'll affect competent people who have standards regardless of the available tooling.
If anything the outcome will be good: mediocre people will produce even worse work and will weed themselves out.
Cause in point: the author of the rebuttal made basic and obvious mistakes that make his work even easier to dismiss and no further paper of his will be considered seriously.
Arainach · 7h ago
>mediocre people will produce even worse work and will weed themselves out.
[[Citation needed]]
I don't believe anyone who has experienced working with other people - in the workspace, in school, whatever - believes that people get weeded out for mediocre output.
Muromec · 6h ago
Weeded out to where anyway? Doing some silly thing, like being a cashier or taxi driver?
delusional · 6h ago
You can also be mediocre in a lot of different ways. Some people are mediocre thinkers, but fantastic hype men. Some people are fantastic at thinking, but suck at playing the political games you have to play in an office. Personally I find that I need some of all of those aspects to have success in a project, the amount varies by the work and external collaborators.
Intelligence isn't just one measure you can have less or more of. I thought we figured this out 20 years ago.
drsim · 7h ago
I think the pull will be hard to resist even for competent people.
Like the obesity crisis driven by sugar highs, the overall population will be affected, and overall quality will suffer, at least for a while.
bananapub · 7h ago
> Mediocre people produce mediocre work. Using AI might make those mediocre people produce even worse work, but I don't think it'll affect competent people who have standards regardless of the available tooling.
this is clearly not the case, given:
- mass layoffs in the tech industry to force more use of such things
- extremely strong pressure from management to use it, rarely framed as "please use this tooling as you see fit"
- extremely low quality bars in all sorts of things, e.g. getting your dumb
"We wrote a 200 word prompt then stuck that and some web scraped data in to an LLM run by Google/OpenAI/Anthropic" site to the top of hacker news, or most of VC funding in the tech world
- extremely large swathes of (at least) the western power structures not giving a shit about doing anything well, e.g. the entire US Federal government leadership now, or the UK government's endless idiocy about "AI Policy development", lawyers getting caught in court having just not even read the documents they put their name on, etc
- actual strong desire from many people to outsource their toxic plans to "AI", e.g. the US's machine learning probation or sentencing stuff
I don't think any of us are ready for the tsunami of garbage that's going to be thrown in to every facet of our lives, from government policy to sending people to jail to murdering people with robots to spamming open source projects with useless code and bug reports etc etc etc
Many years ago I bumped in to Towers of Hanoi in a computer game and failed to solve it algorithmicly, so I suppose I'm lucky I only work a knowledge job rather than an intelligence-based one.
The brilliance of the test, which was strangely lost on Turing, is that the test is doubtful to be passed with any enduring consistency. Intelligence is actually more of a social description. Solving puzzles, playing tricky games, etc is only intelligent if we agree that the actor involved faces normal human constraints or more. We don't actually think machines fulfill that (they obviously do not, that's why we build them: to overcome our own constraints), and so this is why calculating logarithms or playing chess ultimately do not end up counting as actual intelligence when a machine does them.
A simple calculator is vastly better as adding numbers than any human. An chess engine will rival any human grand master. No one would say that this got us closer to AGI.
We could absolutely see LLMs that produce poetry that humans can not tell apart or even prefer to human made poetry. We could have LLMs that are perfectly able to convince humans that they have consciousness and emotions.
Would we have have achieved AGI then? Does that mean those LLMs have gotten consciousness and emotions? No.
The question of consciousness is based on what is going on in the inside, how the reasoning happening and not the output. In fact the first AGI might perform significantly worse in most tasks that current LLMs.
LLMs are extremely impressive but they are not thinking. They do not have consciousness. It might be technically impossible for them to develop anything like that or at least it would require significantly bigger models.
> where slower humans are disbarred from intelligence
Humans have value for being humans. Whether they are slow or fast at thinking. Whether they are neurodivergent or neurotypical. We all have feelings, we are all capable of suffering, we are all alive.
See also the problems with AI Welfare research: https://substack.com/home/post/p-165615548
I don't know if that argument is true, but it does make some sense.
In fact, I think you might argue that modern chess engines might have more of a world model (although an extremely limited one): they interact with the chess game. They learn not merely by studying the rules, but by playing the game millions of times. Of course that's only the "world" of the chess game, but it's something, and as a result, they know what works in chess. They have a concept of truth within the chess rules. Which is super limited of course, but it might be more than what LLMs have.
Philosophers, scientists, thinkers have been trying to define "consciousness" for 100+ years at this point and no one has managed to either a) define it, or b) find ways to test for it.
Saying we have "consciousness" and AI "doesn't" is like saying we have a soul, a ghost in the machine, and AI doesn't. Do we really have a ghost in the machine? Or are we just really a big deterministic machine that we just don't fully understand yet, rather like AI.
So before you assert that we are "conscious", you should first define what you mean by that term and how we test for it conclusively.
I'm talking about a precise, technical, scientific definition, that scientists all agree on, and which doesn't rely on the definitions of other words, and can also be reliably tested.
There has been constant debate about what consciousness means among scientists, philosophers, psychologists for as long as the word has existed. And there has never been any consistent and agreed upon test for consciousness.
The Google definition of consciousness is: "the state of being aware of and responsive to one's surroundings."
By that definition, a Tesla self driving car is conscious, it is aware of and responsive to its surroundings...
An LLM tells me references such as Oxford Dictionary of Science probably include a definition of consciousness but I suppose that would be behind a pay wall so I can't verify it.
Of course you are demanding one that "scientists all agree on" which is an impossibly high bar so I don't think anyone is going to meet you there.
But we don’t really understand how the reasoning is happening in humans. Tests show that our subconscious, completely outside out conscious understanding, makes decisions before we perceive that we consciously decide something [1]. Our consciousness is the output, but we don’t really know what is running in the subconscious. If something looked at it from an outside perspective, would they say that it was just unconscious programing, giving the appearance of conscious reasoning?
I’m not saying LLMs are conscious. But since we don’t really know what gives us the feeling of consciousness, and we didn’t build and don’t understand the underlying “programing”, it is hard to actually judge a non-organic mind that claims the feeling of consciousness. If you found out today that you were actually a computer program, would you say you weren’t conscious? Would you be able to convince “real” people that you were conscious?
[1] https://qz.com/1569158/neuroscientists-read-unconscious-brai...
An important standard for any scientific theory or hypothesis is to be falsifiable . Good old Russell's teapot. We can't disprove that that a small teapot too small to be seen by telescopes, orbits the Sun somewhere in space between the Earth and Mars. So should we assume it is true? No the burden of proof lies on those that make the claim.
So yes, I could not 100 percent disprove that certain LLM's don't show signs of consciousness but that is reversing the burden of proof. Those that make the claims that LLM's are capable of suffering, that they show signs of consciousness need to deliver. If they can't, it is reasonable to assume they are full of shit.
People here accuse me to be scholastic and too philosophical but the the reverse is true. Yes, we barely know how human brains work and how consciousness evolved but whoever doesn't see the qualitative difference between a human being and an LLM really needs to touch grass.
In the next breath: "anyone who disagrees with me is a loser."
> Those that make the claims that LLM's are capable of suffering, that they show signs of consciousness need to deliver. If they can't, it is reasonable to assume they are full of shit.
Replace LLM with any marginalized group. Black people, Jews, etc. I can easily just use this to excuse any heinous crime I want - because you cannot prove that you aren't a philosophical zombie to me.
Defaulting to cruelty in the face of unfalsifiablility is absurd.
This is so flatly ridiculous of an analogy that it becomes racist itself. Maybe the bread I eat is conscious and feels pain (the ancient Manichaens thought so!). Are you know going to refrain from eating bread in case it causes suffering? You can't prove bread doesn't feel pain, you might be "defaulting to cruelty"!
This is “how many angels dance on the head of a pin” territory.
Is this a belief statement, or a provable one?
The problem is that people assume that failing to show that they do means that they don't.
It's very hard to show that something doesn't have consciousness. Try and conclusively prove that a rock does not have consciousness.
I think it's even hard to conclusively prove that LLMs don't have any emotions. They can definitely express emotions (typically don't, but largely because they're trained/tuned to avoid the expression of emotions). Now, are those fake? Maybe, most likely even... but not "clearly" (or provably) so.
Humans are capable of consistently making scientific progress. That means being taught knowledge about the world by their ancestors, performing new experiments, and building upon that knowledge for future generations. Critically, there doesn't seem to be an end to this for the foreseeable future for any field of research. Nobody is predicting that all scientific progress will halt in a few decades because after a certain point it becomes too hard for humans to understand anything, although that probably would eventually become true.
So an AI with at least the same capabilities as a human would be able to do any type of scientific research, including research into AI itself. This is the "general" part: no matter where the research takes it, it must always be able to make progress, even if slowly. Once such an AI exists, the singularity begins.
I think the fact that AI is now a real thing with a tangible economic impact has drawn the attention of a lot of people who wouldn't have otherwise cared about the long-term implications for humanity of exponential intelligence growth. The question that's immediately important now is "will this replace my job?" and so the definitions of AGI that people choose to use are shifting more toward definitions that address those questions.
It's not really pushing boundaries, a non trivial amount of humans has always been excluded from the definition of “human intelligence” (and with the ageing of the population, this number is only going up), and it makes sense, like how you don't consider blind individuals when you're comparing humans sight to other animals'.
I think the complaint made by apple is quite logical though and you mischaracterise it here. The question asked in the Apple study was "if I give you the algorithm that solves a puzzle, can you solve that puzzle?" The answer for most humans should be yes. Indeed, the answer is yes for computers which are not generally intelligent. Models failed to execute the algorithm. This suggests that the models are far inferior to the human mind in terms of their computational ability, which precedes general intelligence if you ask me. It seems to indicate that the models are using more of a "guess and check" approach than actually thinking. (A specifically interesting result was that model performance did not substantially improve between a puzzle with the solution algorithm given, and one where no algorithm was given.)
You can sort of imagine the human mind as the head of a Turing Machine which operates on language tokens, and the goal of an LLM is to imitate the internal logic of that head. This paper seems to demonstrate that they are not very good at doing that. It makes a lot of sense when you think about it, because the models work by consuming their entire input at once where the human mind operates with only a small working memory. A fundamental architectural difference which I suspect is the cause of the collapse noted in the Apple paper.
Does that change if you give them the algorithm description? No. Conversely, the LLMs already know the algorithm, so including it in the prompt makes no difference.
Why? The whole point of the recursive algorithm is that it doesn't matter how many discs you're working with.
The ordinary children's toys that implement the puzzle are essentially always sold with more than 6 discs.
https://www.amazon.com/s?k=towers+of+hanoi
In addition, 64-256 moves is quite a lot and I suspect people will generally lose focus before completing them.
It seems pretty clear to me that learning how to go about following an arbitrary set of rules is a part of general intelligence. There are lots of humans who have learned this skill, and many (mostly children) who have not. If the AI has failed to learn this ability during its extensive training, and critically if it cannot be taught this ability as a human could, then it's certainly not "generally intelligent" to anywhere near the human degree.
It's not necessary to use a stack. If you have a goal, you can work "top down", with nothing held in memory. All you need to know to begin the move is whether you're moving an odd number of discs (in which case, the first move will be onto the target peg) or an even number (in which case it will be onto the third peg).
They tried to give the algorithm description to the LLMs, but they also used the recursive solution (see page 25 of the paper).
If you ask someone how the puzzle works, they're overwhelmingly likely to tell you:
"To move disc 7, first you move discs 1-6 to the third peg, then you move disc 7[, and then you can put discs 1-6 back on top of it]."
This is an explicitly recursive statement of the solution. But the implementation is just that you count down from 7 to 1 while toggling which peg is "the third peg". You can optimize that further by dividing 7 by 2 and taking the remainder, but even if you don't do that, you're using constant space.
What would a human be doing (or thinking) differently, if they were using the recursive algorithm?
I suppose it's "fair" since it's published on the Apple website with the authors' Apple affiliations, but historically speaking, at least in ML where publication is relatively fast-paced and low-overhead, academic papers by small teams of individual researchers have in no way reflected the opinions of e.g. the executives of a large company. I would not be particularly surprised to see another team of Apple researchers publishing a paper in the coming weeks with the opposite take, for example.
Anytime I see "Artificial General Intelligence," "AGI," "ASI," etc., I mentally replace it with "something no one has defined meaningfully."
Or the long version: "something about which no conclusions can be drawn because the proposed definitions lack sufficient precision and completeness."
Or the short versions: "Skippetyboop," "plipnikop," and "zingybang."
What does that mean in concrete terms? I'm not sure. Many of these models can already pass bar exams but how many can be lawyers? Probably none. What's missing?
The qualification is unnecessary; we know the answer is "none". There's a steady stream of lawyers getting penalized for submitting LLM output to judges.
(seriously, forget this stuff: Consider how you come up with an algorithm for how creative or beautiful something is?)
A line that used to work (back when I was in a part of my life where lines were a thing) was "I can tell you're smart, which means that you can tell I'm smart, because like sees like." Usually got a smile out of 'em.
While that may be true, do we understand how LLMs behave according to token budget constraints? This might impact much simpler tasks as well. If we give them a task to list the names of all cities in the world according to population, do they spit out a python script if we give them a 4k output token budget but a full list if we give them 100k?
The thing it gets right: the "Illusion of illusion" rebuttal claims that in the original "Illusion of Thinking" paper's version of the Towers of Hanoi problem, "The authors’ evaluation format requires outputting the full sequence of moves at each step, leading to quadratic token growth"; this doesn't seem to be true at all, and this "Beyond Token Limits" rebuttal^2 is correct to point it out.
(This implies, in particular, that there's something fishy in the IoI rebuttal's little table showing where 5(2^n-1)^2 exceeds the token budget, which they claim explains the alleged "collapse" at roughly those points.)
Things it gets wrong:
"The rebuttal conflates solution length with computational difficulty". This is just flatly false. The IoI rebuttal explicitly makes pretty much the same points as the BTL rebuttal^2 does here.
"The rebuttal paper’s own data contradicts its thesis. Its own data shows that models can generate long sequences when they choose to, but in the findings of the original Apple paper, it finds that models systematically choose NOT to generate longer reasoning traces on harder problems, effectively just giving up." I don't see anything in the rebuttal that "shows that models can generate long sequences when they choose to". What the rebuttal finds is that (specifically for the ToH problem) if you allow the models to answer by describing the procedure rather than enumerating all its steps, they can do it. The original paper didn't allow them to do this. There's no contradiction here.
"It instead completely ignores this finding [that once solutions reach a certain level of difficulty the models give up trying to give complete answers] and offers no explanation as to why models would systematically reduce computational effort when faced with harder problems."
The rebuttal doesn't completely ignore this finding. That little table of alleged ToH token counts is precisely targeted at this finding. (It seems like it's wrong, which is important, but the problem here isn't that the paper ignores this issue, it's that it has a mistake that invalidates how it addresses the issue.)
Things that a good rebuttal^2 should point out but this rebuttal completely ignores:
The most glaring one, to me, is that the rebuttal focuses almost entirely on the Tower of Hanoi, where there's a plausible "the only problem is that there aren't enough tokens" issue, and largely ignores the other problems that the original paper also claims to find "collapse" problems with. Maybe token-limit issues are also sufficient explanation for the problems with other models (e.g., if something is effectively only solvable by exhaustive search, then maybe there aren't enough tokens for the model to do that search in) but the rebuttal never actually makes that argument (e.g., by estimating how many tokens are needed to do the relevant exhaustive search).
The rebuttal does point out what if correct is a serious problem with the original paper's treatment of the "River Crossing" problem (apparently the problem they asked the AI to solve is literally unsolvable for many of the cases they put to it), but the unsolvability starts at N=6 and the original paper finds that the models were unable to solve the problem starting at N=3.
(Anecdata: I had a go at solving the River Crossing problem for N=3 myself. I made a stupid mistake that stopped me finding a solution and didn't have sufficient patience to track it down. My guess is that if you could spawn many independent copies of me and ask them all to solve it, probably about 2/3 would solve it and 1/3 would screw up in something like the way actual-me did. If I actually needed to solve it for larger N I'd write some code, which I suspect the AI models could do about as well as I could. For what it's worth, I think the amount of text-editor scribbling I did while not solving the puzzle was quite a bit less than the thinking-token limits these models had.)
The rebuttal^2 does complain about the "narrow focus" of the rebuttal, but it means something else by that.
No comments yet
Take the comment paper, for example. Since Claude Opus is the first author, I’m assuming that the human author took a backseat and let the AI build the reasoning and most of the writing. Unsurprisingly, it is full of errors and contradictions, to a point where it looks like the human author didn’t bother too much to check what was being published. One might say that the human author, in trying to build some reputation by showing that their model could answer a scientific criticism, actually did the opposite: it provided more evidence that its model cannot reason deeply, and maybe hurt their reputation even more.
But the real question is, did they really? How much backlash will they possibly get from submitting this to arxiv without checking? Would that backlash keep them from submitting 10 more papers next week with Claude as the first author? If one puts in a balance the amount of slop you can put out (with a slight benefit) vs. the bad reputation one gets from it, I cannot say that “human thinking” is actually worth it anymore.
I deployed lots of high performance, clean, well documented etc code generated by Claude or o3. I reviewed it wrt requirements, added tests and so on. Even with that in mind it allowed me to work 3x faster.
But it required conscious effort on my part to point out issues and inefficiencies on LLMs part.
It is a collaborative type of work where LLMs shine (even in so called agentic flows)
If anything the outcome will be good: mediocre people will produce even worse work and will weed themselves out.
Cause in point: the author of the rebuttal made basic and obvious mistakes that make his work even easier to dismiss and no further paper of his will be considered seriously.
[[Citation needed]]
I don't believe anyone who has experienced working with other people - in the workspace, in school, whatever - believes that people get weeded out for mediocre output.
Intelligence isn't just one measure you can have less or more of. I thought we figured this out 20 years ago.
Like the obesity crisis driven by sugar highs, the overall population will be affected, and overall quality will suffer, at least for a while.
this is clearly not the case, given:
- mass layoffs in the tech industry to force more use of such things - extremely strong pressure from management to use it, rarely framed as "please use this tooling as you see fit" - extremely low quality bars in all sorts of things, e.g. getting your dumb "We wrote a 200 word prompt then stuck that and some web scraped data in to an LLM run by Google/OpenAI/Anthropic" site to the top of hacker news, or most of VC funding in the tech world - extremely large swathes of (at least) the western power structures not giving a shit about doing anything well, e.g. the entire US Federal government leadership now, or the UK government's endless idiocy about "AI Policy development", lawyers getting caught in court having just not even read the documents they put their name on, etc - actual strong desire from many people to outsource their toxic plans to "AI", e.g. the US's machine learning probation or sentencing stuff
I don't think any of us are ready for the tsunami of garbage that's going to be thrown in to every facet of our lives, from government policy to sending people to jail to murdering people with robots to spamming open source projects with useless code and bug reports etc etc etc