I wonder if there is any connection between the models producing exaggerated outputs and the litany of exaggerated or overconfident claims that academic media offices or the press have produced from previous studies. Maybe the models trained on the studies and the reports on the studies naturally tend toward the style of attention seeking reports even when directly provided with the studies.
DoctorOW · 1d ago
This is the same mistake we were seeing in commercial use of AI.
1. "This process is flawed due to human bias"
2. Train AI/ML to make the same decisions with the same outcome
3. "How can there be any flaws in this process? AI is bias-free."
johnea · 1d ago
A repost of a previous comment, but the anthropomophization of this tech is so off the charts, I feel like I'm going to be repeating it, a lot:
One of the most offensive words in the anthropomophization of LLMs is: hallucinate.
It's not only an anthropomorphism, it's also a euphemism.
A correct interpretation of the word would imply that the LLM has some fantastical vision that it mistakes for reality. What utter bullsh1t.
Let's just use the correct word for this type of output: wrong.
When the LLM generates a sequence of words, that may or may not be grammatically correct, but infers a state or conclusion that is not factually correct; lets state what actually happened: the LLM generated text was WRONG.
It didn't take a trip down Alice's rabbit hole, it just put words together into a stream that inferred a piece of information that was incorrect, it was just WRONG.
The euphemistic aspect of using this word is a greater offense than the anthropomorphism, because it's painting some cutesy picture of what happened, instead of accurately acknowledging that the s/w generated an incorrect result. It's covering up for the inherent short comings of the tech.
dinfinity · 1d ago
Imagine the irony if this article were to exaggerate the claims made in the study itself.
Perhaps saying things like "Most leading chatbots routinely exaggerate science findings" instead of "We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet [...] with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26–73% of cases".
To be fair, the article itself already mentions this: "Summaries by models (1), (4), (8), and (9) didn’t significantly differ in the kind of generalisations they contained from the original text. “So basically,” Peters concludes, “Claude, in different versions, did really well.”"
GiorgioG · 1d ago
I am shocked, I tell you I’m flabbergasted that LLMs which cannot truly reason go down the wrong rabbit-hole nearly every time. I’ve spent a good chunk of the weekend trying to accelerate the development of a small SaaS solution using Cursor, CoPilot, etc even with the latest and greatest Claude Sonnet 4, paying for “Max”, etc and any meaningfully sized request, winds up being a completely frustrating experience of trying to get these tools to stay on the rails. I’m to the point where I will give these fuckers explicit instructions to come up with several hypotheses, potential solutions, and then not to generate code before getting my go-ahead and it still more often than not go ahead and start editing code without my approval/direction. As a bonus it will forget a few things we corrected earlier. Can’t wait for the first vibe-coder to get sued when there’s a massive breach or financial loss. These tools are not ready for prime time and they certainly aren’t worthy of the billions being spent on them.
proc0 · 1d ago
Absolutely, they're being hyped to the extreme talking about how it's AGI in the next two years, yet I cannot get one of them to generate a jar with no lid. It just creates the lid every single time, but that's just an example. They succeed at many things, but they fail so hard at such simple things that it casts a huge shadow on the times it does succeed.
Because they work with statistics and averaging training data, it makes sense that certain things it will have copied correctly, and others it will have completely mixed up and it cannot process properly. The problem is that they are literally being sold as reasoning models, and AI that can be just like a junior developer. These comparisons are borderline irresponsible as it could be driving a bubble that will burst and affect millions of people.
I have, and I still can't get it to stop acting like a drunken compulsive liar with an enormous wealth of information.
mannanj · 1d ago
I've been wondering this as well. did you use any prompts to help split into roles of product, engineering, well defined tasks.
GiorgioG · 1d ago
I even used Claude to help me formulate my prompts, Cursor rules, etc.
ninetyninenine · 1d ago
I'm sick of people saying LLMs don't reason as if they know unequivocally.
We dont' understand what's going on with LLMs. We have evidence of them reasoning successfully. And we have evidence of them Failing to reason.
That evidence DOES not logically lead to "LLMs don't reason". There are multiple possibilities here.
1. LLMs can reason, but they can't tell the difference between hallucination and reasoning.
2. LLMs can reason, but they choose to lie.
3. LLMs can't reason, when they get something right, it's pure coincidence.
There is not a single person who can prove or disprove ANY of those 3 points. What most people end up doing is ironically Identical to what the LLM does. They hallucinate an answer: "LLMs cannot truly reason."
Think about it. There is NO EVIDENCE or insight into how an LLM works that can even tell us how an LLM arrived at a specific response. We HAVE NOTHING.
Yet why do I see everywhere people like this guy, who makes claims out of nowhere? Which brings us back full circle: do humans reason? How similar is human hallucination to LLM hallucination?
tombert · 1d ago
Anti-AI people are kind of weird.
I got put on probation on SomethingAwful because I created a thread talking about an AI project I was working on (An Icecast radio station that uses OpenAI to generate DJ chatter and commercials), and everyone acted like I was simultaneously a completely uncreative moron also also somehow stealing work from people I would have hired, like I was somehow depriving a DJ of a job by not hiring one. I am an unemployed software person who is building something for fun, "hiring a dedicated 24 hour DJ" was never on the table.
In this thread, I noticed a lot of assertions that were just being accepted as axiomatically true, like asserting the AI is worse for the environment than humans doing the equivalent labor (which is not nearly as cut and dry), and that AI can't reason and that anything that involves any AI is inherently "theft". These assertions are completely unqualified and people just eat it up.
I don't know if LLMs "reason" by any consistent definition of the word. They might, they might not, I'm not going to pretend to know, but I find it a little irritating how people just assert that they don't and people just gobble it up.
idiot_slaiyer · 21h ago
op you asked, what about the energy expended to generate the art. the answer is this: the average human runs on about 2000-2500kcal of energy a day. if you were to spend 25 hours working on a piece of art, the energy expenditure is between 2000 and 2500kcal.
if you were to generate one piece of AI art, the energy expended is several magnitudes higher. but if you understood basic mathematics and human biology, you would have understood this going into the discussion
tombert · 18h ago
I can generate a picture in stable diffusion in about two seconds with my GPU. Running a GPU for two seconds doesn't take much energy, almost certainly less than 2500kcal. Back of napkin math for a 250W GPU spending about 4 seconds at full blast says it would take about 1000 joules, which dividing by 4184 to get it into kcal, would be roughly ~0.24kcal. It is literally 10000x less energy than the number you provided me.
Even if my math is off by a bit, let's assume that it's off by two orders of magnitude, it would still only use 1% of the energy compared to a human.
So the basic math you gave me supports what I said. Unless you find an error, which I don't think you will because this math is trivial.
This isn't counting the energy cost of training and gathering the data, which sure might be expensive, but if we assume the models already exist then the energy cost of generating a single image is pretty negligible.
Now, a counter point you could make is that since it's so low-effort to generate an image with Stable Diffusion, and since the image might not be very good, you might end up generating thousands of images compared to the one you would have paid a human to do, and yeah those numbers get more complicated, but that wasn't what you claimed, you claimed "if you were to generate one piece of AI art, the energy expended is several magnitudes higher" which is trivial to prove wrong with basic math.
But also, counting calories like this is bad anyway, because it's making an assumption that all energy is equally damaging to the environment, which is not true. If the energy used to generate these images was coming from a centralized electrical plant, particular something using solar or nuclear, that is considerably less bad for the environment than a human eating meat. The world beef industry has millions of cows that are all farting and burping and shitting methane that is destroying our climate.
That's what I was getting at when I said the numbers aren't as cut and dry as people keep asserting.
Feel free to check my math and prove me wrong, I'm a grown up, if you find concrete evidence that I'm wrong I'll read it.
ETA:
Also, did you make an account literally just to respond to me? Goons are following me around now?
Another ETA:
I forgot to point out that even if the electrical plants are coming from fossil fuels, which is still annoyingly prevalent in the US, having the energy production centralized means that we're much more easily able to capture pollutants compared to something like cow fart or a cow burp, which pretty much immediately goes straight into the atmosphere as methane.
Now, an argument you could make is that AI art is shit so any amount of energy expended on it is a waste of energy, and that's a better argument than the stupid and objectively wrong one you made.
GiorgioG · 1d ago
> Anti-AI people are kind of weird.
I'm not Anti-AI, I'm Anti-shit that doesn't provide meaningful value and for building reliable professional software...they are not as valuable as "they" would have you believe.
ninetyninenine · 1d ago
Yeah it’s annoying af. This is supposed to be a forum where more intelligent people gather but people just don’t get it.
antihipocrat · 1d ago
I don't think the three options provided are exhaustive. For example:
4. LLMs can't reason, when they get something right, it's because the corpus of information used to create the model provides a very high probability that the response to that specific prompt is correct.
ninetyninenine · 1d ago
They're not exhaustive, but add as many options as you want to the ones given. Most of those options can't be proven or disproven. We don't understand what's going on with LLMs.
We've built something from scratch that we can't understand or control. That is what an LLM is.
I don't have to prove that the LLM can't reason. The claims that they can have yet to be proven.
ninetyninenine · 1d ago
There was literally only One claim in this entire thread:
"LLMs can't reason"
No OTHER claim was made. So given the fact there was only one claim, Where does the burden of proof lie? Hint: the person who made the claim.
staunton · 1d ago
4. We don't actually have a coherent concept of what "reasoning" means in general. Same for "hallucination"...
ninetyninenine · 1d ago
We do. Reasoning is clearly defined. Sentience or consciousness is not.
Reasoning is basically the same as using logic to arrive at a conclusion when given a set of rules and axioms.
Hallucination is producing a conclusion by not following rules or axioms.
staunton · 1d ago
You could go further and say that anything a computation could ever possibly do is a matter of using logic to arrive at a conclusion given just the right rules and axioms. That would mean anything going beyond that (potentially "sentience" or "consciousness", possibly even -- as you seemed to suggest -- "hallucination", though apparently a computer program can do it) would have to be "magic" that's impossible to even describe using logic-based science, let alone simulate by a program. (Could we have science that transcends logic?)
My original claim refers to the mundane situations where people get an LLM output which isn't what they wanted and go "see, it isn't reasoning! That's the problem!". For each such instance (or at least most of them), the "rules and axioms" that would have to be used to arrive at the desired conclusion would be different, sometimes incompatible, and in any case the people demanding "that it reason" wouldn't be able to spell those rules and axioms out, often even given years to do it, at least without allowing them to be obviously contradictory or trivially close to the specific conclusion or output they want. (Incidentally, that's why LLMs are interesting in the first place!)
So sure, you can trick an LLM into failing on a syllogism problem, and that does mean "it can't reason" in a certain narrow sense, but I don't think that's what we're actually talking about when someone is contrasting "reasoning" and "hallucinating", which is what I was originally objecting to. That's what I meant when I said that we don't have a coherent concept of what "reasoning" means in general.
proc0 · 1d ago
You can derive #3 from first principles of how the underlying gradient descent and backpropagation work. They approximate functions, therefore when it gets something wrong it means something went wrong with either the gradient descent or back propagation, or the training data.
We're not talking about getting a certain math question wrong (humans make mistakes like this). We're talking about ridiculous mistakes that are completely random and that humans would never make. I would even go as far to say that calling them mistakes is a stretch. The algorithm simply did not capture some mapping between the prompt and the output answer.
This is not how humans make mistakes. Humans build knowledge in some kind of logical knowledge tree, and mistakes arise from how this mechanism works (which we don't know exactly how, but we can certainly conclude it is completely different than transformers and LLMs). Most humans don't make random mistakes but rather something like logical mistakes.
tl;dr; It's clear to me LLMs make mistakes in such a way that exposes their simple underlying mechanism, as oppose to humans who make mistakes that contain rich layers of logical reasoning.
ninetyninenine · 1d ago
>You can derive #3 from first principles of how the underlying gradient descent and backpropagation work. They approximate functions, therefore when it gets something wrong it means something went wrong with either the gradient descent or back propagation, or the training data.
Then derive it from first principles. Use mathematical notation. You can't even draw this curve... it has so many dimensions. The crazy thing is you started hallucinating to me AFTER you told me you can derive it from first principles. You claimed you can derive it, then proceeded to NOT derive it.
>We're not talking about getting a certain math question wrong (humans make mistakes like this). We're talking about ridiculous mistakes that are completely random and that humans would never make. I would even go as far to say that calling them mistakes is a stretch. The algorithm simply did not capture some mapping between the prompt and the output answer.
Humans make plenty of ridiculous mistakes. Even so humans can lie. How do you know it's not lying? Again. Prove it.
>This is not how humans make mistakes. Humans build knowledge in some kind of logical knowledge tree, and mistakes arise from how this mechanism works (which we don't know exactly how, but we can certainly conclude it is completely different than transformers and LLMs). Most humans don't make random mistakes but rather something like logical mistakes.
Please derive how humans reason from first principles. I mean this by showing me experimental evidence that shows me the genesis of a signal traveling through the human brain and branching through billions of neurons to produce "reasoning".
Oh you can't? Well it looks like you're just making an approximation here? Possibly an Hallucination. Sound familiar?
>tl;dr; It's clear to me LLMs make mistakes in such a way that exposes their simple underlying mechanism, as oppose to humans who make mistakes that contain rich layers of logical reasoning.
You had to do make several assumptions and leaps in creativity to arrive at your conclusion. It's an hallucination through and through.
proc0 · 1d ago
I said "you can" derive, and I could add "in theory" in the casual sense. I don't have access to these large models, or the training or the time to really analyze these large models to an extent that I can derive their behavior mathematically. My point was that we do know a lot about them, and it's enough to conclude that it's completely different than what humans do with their brains, and this is evident from their output.
> Oh you can't? Well it looks like you're just making an approximation here? Possibly an Hallucination. Sound familiar?
I don't think anyone would call a wrong theory a hallucination. This is also one of those stretched-out terms that makes the discussion harder. Hallucination means seeing something that is not there, and LLMs cannot see in this way... but I'm not saying AI is completely incapable of "seeing" like humans do. My claim is only about the current state of LLMs, but I digress. If I come up with a theory of human thought, it will be based on logical reasoning as we know it, namely how the brain processes thought. I'm claiming something very simple which is that human thought is a very different algorithm than how current LLMs work.
I could grant that LLMs might have component of human thinking, but it still a stretch to call it reasoning or thinking.
> You had to do make several assumptions and leaps in creativity to arrive at your conclusion. It's an hallucination through and through.
Again, it's not hallucination to be wrong when it comes to humans. An average IQ human that understands the world does not "hallucinate" like LLMs. It just doesn't happen unless we play semantics and call any mistake a hallucination, but that's my contention here.
ninetyninenine · 1d ago
>I said "you can" derive, and I could add "in theory" in the casual sense. I don't have access to these large models, or the training or the time to really analyze these large models to an extent that I can derive their behavior mathematically. My point was that we do know a lot about them, and it's enough to conclude that it's completely different than what humans do with their brains, and this is evident from their output.
If it can be done. Why hasn't it been done? Why do we have articles like this?: https://futurism.com/anthropic-ceo-admits-ai-ignorance? You can do it in theory but then you have the CEO of anthropic who says nobody knows shit? You need to go to anthropic right away and tell them about your revolutionary discovery here.
Or maybe you're just hallucinating. Clearly.
>I don't think anyone would call a wrong theory a hallucination.
you didn't come up with a theory. You made a claim. And you arrived at that claim with NO evidence. Then to back up your claim you made a "theory" as if it was a substitute for evidence. That's an hallucination. You Hallucinated and your basis of the hallucination was a "theory" and you're aware of what you did.
>Again, it's not hallucination to be wrong when it comes to humans. An average IQ human that understands the world does not "hallucinate" like LLMs. It just doesn't happen unless we play semantics and call any mistake a hallucination, but that's my contention here.
Making shit up and being wrong is not an hallucination because it comes from humans? I think your entire response is in itself an hallucination.
proc0 · 1d ago
> Making shit up and being wrong is not an hallucination because it comes from humans?
Yes. And no, I am not "hallucinating" this text even if it's wrong. I have a structured pattern of thoughts with beliefs. LLMs don't generate these structured thoughts, a.k.a. logic and reasoning... which since you mention it, Anthropic did do research on this:
> This is concerning because it suggests that, should an AI system find hacks, bugs, or shortcuts in a task, we wouldn’t be able to rely on their Chain-of-Thought to check whether they’re cheating or genuinely completing the task at hand.
https://www.anthropic.com/research/reasoning-models-dont-say...
So there is definitely work happening to try to understand and see how LLMs work, and we are finding out it is very different than the human mind. That isn't to say they are not useful, or there will never be an AI that does this. The point is that current LLMs are not moving in that direction, but they give the appearance as if they are. They hype it up as if we will have these junior developers implying we can just interact with them like humans. It's looking more like they are tools that respond to natural language with noticeable limitations, and that is a different framing.
ninetyninenine · 1d ago
Having a structured pattern of thoughts and beliefs doesn’t make that structure correct. LLMs can’t outline their own thoughts in clearly defined structure too.
Let’s see where there are gaps with your thinking: First the topic of the convo is that we don’t know anything about LLMs so we can’t make a claim that LLMs can’t reason.
Your response doesn’t have anything to do with that anymore. You’ve went off topic into hype, what anthropic is trying to find out about LLMs and a bunch of other tangents and have failed to address the topic of: LLMs can’t reason.
Even LLMs don’t hallucinate past the give topic.
proc0 · 21h ago
I think I've already given sufficient logic to demonstrate LLMs don't "reason" like humans (whatever it is we want to call "reason", it is not the same).
You keep using "hallucinate" to compare LLMs and humans, but the word itself demonstrates the difference. Human hallucination is about incorrect perception of the world, not generating tokens. These are meaningful differences.
You say "we don't know anything about LLMs" but we do know how to make them, and how they are structured, and Anthropic has been making progress in explaining their side effects.
Your argument is not sound, and would include any form of computation as reasoning, so we could just say our phones and laptops are all doing reasoning as well. After all we don't know what the human brain is doing and it probably is a computer, therefore all computers reason. As you can see that does not sound right.
lostmsu · 1d ago
> you're aware of what you did.
No he is not, as clearly demonstrated by this very thread.
ninetyninenine · 13h ago
I thought it was obvious and he’s being stubborn. I guess you’re right.
tbrownaw · 1d ago
The is zero (legitimate) question about what LLMs do. They're math, not magic.
They highlight fun philosophical / definitional questions like the Chinese room thought experiment, but that's it.
ninetyninenine · 1d ago
And where is your evidence of this? Can you even prove this for a single query/response pair? Can you trace the genesis of the signals travelling through the neural network and prove to me via the structure that no reasoning was performed at all? You don't. You arrived at this claim with NO evidence.
How do you arrive at a claim without evidence? There's a word for it. It's called an Hallucination. Humans, like LLMs, have trouble saying "I don't know."
antihipocrat · 1d ago
Pretty sure the people creating the best models understand how these things work. Plenty of papers explain the process and you and I can create our own basic model ourselves from scratch with enough aptitude and drive.
There is a lot of art involved with the best models. We aren't dealing with determinism with regard to the corpus used to train the model (too much information to curate for accuracy) nor the LLM output (probabilistic by design) nor the prompt input (humans language is dynamic and open to multiple interpretations) nor the human subjective assessment of the output.
That there is a product available that manages to produce great results despite these huge challenges is amazing and this is the part that is not quantifiable - in the sense that the data scientists decisions made with regard to temperature settings etc are not derived from any fundamental property but more from an inspired group of people with qualitative instincts aligned with producing great results
Let me tell you something. You can go online, find a tutorial on how to make an LLM, and actually make one at home. The only thing stopping you from making an OpenAI scale LLM is compute resources.
If you happen to make an LLM from scratch you won't know how it works, and neither do the people who are at OpenAi or anthropic.
meroes · 1d ago
2 years of RLHF I don’t see reasoning. It’s clearly a text-similarity game, plus training of course. My mental model to work best is not that I’m interacting with a thing with reasoning. That differs from my mental model of interacting with humans. I would be worse at my job if I interacted with LLMs as if they reasoned.
ninetyninenine · 1d ago
Where's the evidence? You have none. Again you made a leap here with ZERO evidence which is equivalent to an hallucination.
mannanj · 1d ago
I think were just training them.
delichon · 1d ago
“You asked her what color a house was and she said, ‘It’s white on this side.’”
“That’s right.”
“She didn’t assume that the other side was white, too… and a Fair Witness wouldn’t.”
-- Stranger in a Strange Land (1961)
An LLM is an abstraction machine, it mashes together anything that is nearby in a high dimensional space. Its statistical model is its source of truth. For a Fair Witness AI reasoning needs to supplant statistics. Which I'm guessing can get weird fast. LLMs are really good at being suggestible. For this we need the opposite.
kjfaejgoaeijhei · 1d ago
That's an extremely interesting observation, it kind of reminds me of some of the traps bayesians sometimes fall into... (not saying bayesian reasoning can't be very useful)
proc0 · 1d ago
Studies like this should make it evident that LLMs are not reasoning at all. An AI that would reason like humans would also make mistakes like humans and by now we can all see that LLM mistakes are completely random and nonsensical.
It is also unclear that the current rate of progress is in the direction that would solve this issue. I think generative AI for images and video will get better, but the reasoning capabilities seem to be in a different domain.
ants_everywhere · 1d ago
> Studies like this should make it evident that LLMs are not reasoning at all. An AI that would reason like humans....
Humans don't reason either. Reasoning is something we do in writing, especially with mathematical and logical notation. Just about everything else that feels like reasoning is something much less.
This has been widely known at least since the stories where Socrates made everybody look like fools. But it's also what the psychological research shows. What people feel like they're doing when they're reasoning is very different with what they're actually doing.
proc0 · 1d ago
Well no, most people can reason without writing or speaking. I can just think and reason about anything. Not sure what you mean.
Reasoning is something like structured thoughts. You have a series of thoughts that build on each other to produce some conclusion (also a thought). If we assume that the brain is a computer, then thoughts and reasoning are implemented on brain software with some kind of algorithm... and I think it's pretty obvious this algorithm is completely different than what happens in LLMs... to the extent that we can safely say it is not reasoning like the brain does.
There is also a semantic argument here, if we say that since we don't know what humans are doing then we can also stretch the word and use it for AI, but I think this is muddying the waters and creating all the hype that I think will not deliver what it's promising.
ants_everywhere · 1d ago
That's not at all what the brain does though.
What the brain does is closer to activating a bunch of different ideas in parallel. Some of those activations rise to the level of awareness, some don't. Each activation triggers others by common association. And we try to make the best of that thought soup by a combination of reward neurochemicals and emotions.
A human brain is nothing at all like a computer in terms of logic. It's much more like an LLM. That makes sense because LLMs came largely from trying to build artificial versions of biological neural networks. One big difference is that LLMs always think linguistically, whereas language is only a relatively small part of what brains do.
bethekidyouwant · 1d ago
Most leading chatbots routinely exaggerate science findings - ahh yes this is completely unique to llms
kjfaejgoaeijhei · 1d ago
To quote the article: "Chatbots were nearly five times more likely to produce broad generalisations than their human counterparts."
The following might be rude and unkind, but at this point frankly necessary: LLM apologists should stop projecting.
ekianjo · 1d ago
> Chatbots were nearly five times more likely to produce broad generalisations than their human counterparts
Never seen a journalist not make broad generalizations on a scientific study so I am not sure where these numbers are coming from.
It will tell you exactly where the numbers are coming from - the comparison is between (a variety of) LLMs and "expert-written summaries from NEJM Journal Watch".
redcobra762 · 1d ago
What is the human counterpart to an LLM? The article doesn't describe.
And "LLM apologists" is so polemical it's hard to take seriously. We get it, you don't like GenAI. That's fine, but can we talk about it without getting normative?
kjfaejgoaeijhei · 1d ago
The article doesn't, but you can go to the "more information" section and keep digging:
"To systematically assess differences between LLM-generated and human-written summaries, we also collected the corresponding expert-written summaries from NEJM Journal Watch (henceforth ‘NEJM JW’)"
Aurornis · 1d ago
Online forums for difficult medical conditions are full of ChatGPT copy-and-paste responses right now. LLMs are the tool of choice for people who want answers and want them now. Much like this article claims, a popular usage is to prompt the LLM to get a more optimistic interpretation about a specific supplement or treatment.
This is a really difficult problem because these people are often very sick and not getting the answers they want from their doctors. Previously this void was filled by alternative medicine doctors and quacks selling supplements. Now ChatGPT has arrived and has convinced a lot of them that they have a super-human AI at their fingertips that can be massaged to produce any answer they want to hear.
It’s painful to try to read some of these forums where threads have turned into endless “here’s what ChatGPT says” pasted walls of text, followed by someone else trying to counter with a different ChatGPT wall of text.
This isn’t unique to LLMs. There is a huge market for grossly exaggerating the conclusions of scientific studies. Podcasters like Huberman and Dr. Rhonda Patrick are famous for taking obscure studies with questionable conclusions and extrapolating to “protocols” or supplement stacks for their fans to follow. I often get downvoted when I mention fan-favorite podcasters by name, but I think by now many listeners have caught on to the way they exaggerate small studies into exciting listening material.
kjfaejgoaeijhei · 1d ago
Definitely interesting how LLMs seem to be very good at reinforcing confirmation bias...
Aurornis · 1d ago
One of my friends went down a path of using ChatGPT to validate his belief that his alternative medicine approach was better than what his doctors recommended.
He had a whole host of tricks to work around ChatGPT’s protections or cautious replies. He’d strip out the cautions because he thought it was just OpenAI’s lawyers forcing disclaimers into the model. If he didn’t get the answer he wanted, he’d just retry or rephrase until he did.
I think the other half of the problem is that LLM users can be very good at pushing the LLM into doing confirmation bias. It’s much easier when you can hit the retry button with no side effects, unlike a human who will recognize what you’re trying to do when you keep asking different variations of a question until you get the answer you want.
Mistletoe · 1d ago
I’m sure they were trained on clickbait articles and PR releases from universities, which routinely do the same and misinterpret and overstate the importance of the science.
ekianjo · 1d ago
LLM being trained on a corpus of pitiful science reporting from the mainstream press... This is exactly what you would expect.
gustavus · 1d ago
"Over a year, we collected 4,900 summaries. When we analysed them, we found that six of ten models systematically exaggerated claims they found in the original texts"
So it turns out llms trained largely on Internet science articles make the same mistakes as are made by science journalists.
No comments yet
more_corn · 1d ago
Also, most leading news outlets routinely exaggerate science findings
One of the most offensive words in the anthropomophization of LLMs is: hallucinate.
It's not only an anthropomorphism, it's also a euphemism.
A correct interpretation of the word would imply that the LLM has some fantastical vision that it mistakes for reality. What utter bullsh1t.
Let's just use the correct word for this type of output: wrong.
When the LLM generates a sequence of words, that may or may not be grammatically correct, but infers a state or conclusion that is not factually correct; lets state what actually happened: the LLM generated text was WRONG.
It didn't take a trip down Alice's rabbit hole, it just put words together into a stream that inferred a piece of information that was incorrect, it was just WRONG.
The euphemistic aspect of using this word is a greater offense than the anthropomorphism, because it's painting some cutesy picture of what happened, instead of accurately acknowledging that the s/w generated an incorrect result. It's covering up for the inherent short comings of the tech.
Perhaps saying things like "Most leading chatbots routinely exaggerate science findings" instead of "We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet [...] with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26–73% of cases".
To be fair, the article itself already mentions this: "Summaries by models (1), (4), (8), and (9) didn’t significantly differ in the kind of generalisations they contained from the original text. “So basically,” Peters concludes, “Claude, in different versions, did really well.”"
Because they work with statistics and averaging training data, it makes sense that certain things it will have copied correctly, and others it will have completely mixed up and it cannot process properly. The problem is that they are literally being sold as reasoning models, and AI that can be just like a junior developer. These comparisons are borderline irresponsible as it could be driving a bubble that will burst and affect millions of people.
https://www.anthropic.com/engineering/claude-code-best-pract...
We dont' understand what's going on with LLMs. We have evidence of them reasoning successfully. And we have evidence of them Failing to reason.
That evidence DOES not logically lead to "LLMs don't reason". There are multiple possibilities here.
1. LLMs can reason, but they can't tell the difference between hallucination and reasoning.
2. LLMs can reason, but they choose to lie.
3. LLMs can't reason, when they get something right, it's pure coincidence.
There is not a single person who can prove or disprove ANY of those 3 points. What most people end up doing is ironically Identical to what the LLM does. They hallucinate an answer: "LLMs cannot truly reason."
Think about it. There is NO EVIDENCE or insight into how an LLM works that can even tell us how an LLM arrived at a specific response. We HAVE NOTHING.
Yet why do I see everywhere people like this guy, who makes claims out of nowhere? Which brings us back full circle: do humans reason? How similar is human hallucination to LLM hallucination?
I got put on probation on SomethingAwful because I created a thread talking about an AI project I was working on (An Icecast radio station that uses OpenAI to generate DJ chatter and commercials), and everyone acted like I was simultaneously a completely uncreative moron also also somehow stealing work from people I would have hired, like I was somehow depriving a DJ of a job by not hiring one. I am an unemployed software person who is building something for fun, "hiring a dedicated 24 hour DJ" was never on the table.
In this thread, I noticed a lot of assertions that were just being accepted as axiomatically true, like asserting the AI is worse for the environment than humans doing the equivalent labor (which is not nearly as cut and dry), and that AI can't reason and that anything that involves any AI is inherently "theft". These assertions are completely unqualified and people just eat it up.
I don't know if LLMs "reason" by any consistent definition of the word. They might, they might not, I'm not going to pretend to know, but I find it a little irritating how people just assert that they don't and people just gobble it up.
if you were to generate one piece of AI art, the energy expended is several magnitudes higher. but if you understood basic mathematics and human biology, you would have understood this going into the discussion
Even if my math is off by a bit, let's assume that it's off by two orders of magnitude, it would still only use 1% of the energy compared to a human.
So the basic math you gave me supports what I said. Unless you find an error, which I don't think you will because this math is trivial.
This isn't counting the energy cost of training and gathering the data, which sure might be expensive, but if we assume the models already exist then the energy cost of generating a single image is pretty negligible.
Now, a counter point you could make is that since it's so low-effort to generate an image with Stable Diffusion, and since the image might not be very good, you might end up generating thousands of images compared to the one you would have paid a human to do, and yeah those numbers get more complicated, but that wasn't what you claimed, you claimed "if you were to generate one piece of AI art, the energy expended is several magnitudes higher" which is trivial to prove wrong with basic math.
But also, counting calories like this is bad anyway, because it's making an assumption that all energy is equally damaging to the environment, which is not true. If the energy used to generate these images was coming from a centralized electrical plant, particular something using solar or nuclear, that is considerably less bad for the environment than a human eating meat. The world beef industry has millions of cows that are all farting and burping and shitting methane that is destroying our climate.
That's what I was getting at when I said the numbers aren't as cut and dry as people keep asserting.
Feel free to check my math and prove me wrong, I'm a grown up, if you find concrete evidence that I'm wrong I'll read it.
ETA:
Also, did you make an account literally just to respond to me? Goons are following me around now?
Another ETA:
I forgot to point out that even if the electrical plants are coming from fossil fuels, which is still annoyingly prevalent in the US, having the energy production centralized means that we're much more easily able to capture pollutants compared to something like cow fart or a cow burp, which pretty much immediately goes straight into the atmosphere as methane.
Now, an argument you could make is that AI art is shit so any amount of energy expended on it is a waste of energy, and that's a better argument than the stupid and objectively wrong one you made.
I'm not Anti-AI, I'm Anti-shit that doesn't provide meaningful value and for building reliable professional software...they are not as valuable as "they" would have you believe.
4. LLMs can't reason, when they get something right, it's because the corpus of information used to create the model provides a very high probability that the response to that specific prompt is correct.
We've built something from scratch that we can't understand or control. That is what an LLM is.
I don't have to prove that the LLM can't reason. The claims that they can have yet to be proven.
"LLMs can't reason"
No OTHER claim was made. So given the fact there was only one claim, Where does the burden of proof lie? Hint: the person who made the claim.
Reasoning is basically the same as using logic to arrive at a conclusion when given a set of rules and axioms.
Hallucination is producing a conclusion by not following rules or axioms.
My original claim refers to the mundane situations where people get an LLM output which isn't what they wanted and go "see, it isn't reasoning! That's the problem!". For each such instance (or at least most of them), the "rules and axioms" that would have to be used to arrive at the desired conclusion would be different, sometimes incompatible, and in any case the people demanding "that it reason" wouldn't be able to spell those rules and axioms out, often even given years to do it, at least without allowing them to be obviously contradictory or trivially close to the specific conclusion or output they want. (Incidentally, that's why LLMs are interesting in the first place!)
So sure, you can trick an LLM into failing on a syllogism problem, and that does mean "it can't reason" in a certain narrow sense, but I don't think that's what we're actually talking about when someone is contrasting "reasoning" and "hallucinating", which is what I was originally objecting to. That's what I meant when I said that we don't have a coherent concept of what "reasoning" means in general.
We're not talking about getting a certain math question wrong (humans make mistakes like this). We're talking about ridiculous mistakes that are completely random and that humans would never make. I would even go as far to say that calling them mistakes is a stretch. The algorithm simply did not capture some mapping between the prompt and the output answer.
This is not how humans make mistakes. Humans build knowledge in some kind of logical knowledge tree, and mistakes arise from how this mechanism works (which we don't know exactly how, but we can certainly conclude it is completely different than transformers and LLMs). Most humans don't make random mistakes but rather something like logical mistakes.
tl;dr; It's clear to me LLMs make mistakes in such a way that exposes their simple underlying mechanism, as oppose to humans who make mistakes that contain rich layers of logical reasoning.
Then derive it from first principles. Use mathematical notation. You can't even draw this curve... it has so many dimensions. The crazy thing is you started hallucinating to me AFTER you told me you can derive it from first principles. You claimed you can derive it, then proceeded to NOT derive it.
>We're not talking about getting a certain math question wrong (humans make mistakes like this). We're talking about ridiculous mistakes that are completely random and that humans would never make. I would even go as far to say that calling them mistakes is a stretch. The algorithm simply did not capture some mapping between the prompt and the output answer.
Humans make plenty of ridiculous mistakes. Even so humans can lie. How do you know it's not lying? Again. Prove it.
>This is not how humans make mistakes. Humans build knowledge in some kind of logical knowledge tree, and mistakes arise from how this mechanism works (which we don't know exactly how, but we can certainly conclude it is completely different than transformers and LLMs). Most humans don't make random mistakes but rather something like logical mistakes.
Please derive how humans reason from first principles. I mean this by showing me experimental evidence that shows me the genesis of a signal traveling through the human brain and branching through billions of neurons to produce "reasoning".
Oh you can't? Well it looks like you're just making an approximation here? Possibly an Hallucination. Sound familiar?
>tl;dr; It's clear to me LLMs make mistakes in such a way that exposes their simple underlying mechanism, as oppose to humans who make mistakes that contain rich layers of logical reasoning.
You had to do make several assumptions and leaps in creativity to arrive at your conclusion. It's an hallucination through and through.
> Oh you can't? Well it looks like you're just making an approximation here? Possibly an Hallucination. Sound familiar?
I don't think anyone would call a wrong theory a hallucination. This is also one of those stretched-out terms that makes the discussion harder. Hallucination means seeing something that is not there, and LLMs cannot see in this way... but I'm not saying AI is completely incapable of "seeing" like humans do. My claim is only about the current state of LLMs, but I digress. If I come up with a theory of human thought, it will be based on logical reasoning as we know it, namely how the brain processes thought. I'm claiming something very simple which is that human thought is a very different algorithm than how current LLMs work.
I could grant that LLMs might have component of human thinking, but it still a stretch to call it reasoning or thinking.
> You had to do make several assumptions and leaps in creativity to arrive at your conclusion. It's an hallucination through and through.
Again, it's not hallucination to be wrong when it comes to humans. An average IQ human that understands the world does not "hallucinate" like LLMs. It just doesn't happen unless we play semantics and call any mistake a hallucination, but that's my contention here.
If it can be done. Why hasn't it been done? Why do we have articles like this?: https://futurism.com/anthropic-ceo-admits-ai-ignorance? You can do it in theory but then you have the CEO of anthropic who says nobody knows shit? You need to go to anthropic right away and tell them about your revolutionary discovery here.
Or maybe you're just hallucinating. Clearly.
>I don't think anyone would call a wrong theory a hallucination.
you didn't come up with a theory. You made a claim. And you arrived at that claim with NO evidence. Then to back up your claim you made a "theory" as if it was a substitute for evidence. That's an hallucination. You Hallucinated and your basis of the hallucination was a "theory" and you're aware of what you did.
>Again, it's not hallucination to be wrong when it comes to humans. An average IQ human that understands the world does not "hallucinate" like LLMs. It just doesn't happen unless we play semantics and call any mistake a hallucination, but that's my contention here.
Making shit up and being wrong is not an hallucination because it comes from humans? I think your entire response is in itself an hallucination.
Yes. And no, I am not "hallucinating" this text even if it's wrong. I have a structured pattern of thoughts with beliefs. LLMs don't generate these structured thoughts, a.k.a. logic and reasoning... which since you mention it, Anthropic did do research on this:
> This is concerning because it suggests that, should an AI system find hacks, bugs, or shortcuts in a task, we wouldn’t be able to rely on their Chain-of-Thought to check whether they’re cheating or genuinely completing the task at hand. https://www.anthropic.com/research/reasoning-models-dont-say...
So there is definitely work happening to try to understand and see how LLMs work, and we are finding out it is very different than the human mind. That isn't to say they are not useful, or there will never be an AI that does this. The point is that current LLMs are not moving in that direction, but they give the appearance as if they are. They hype it up as if we will have these junior developers implying we can just interact with them like humans. It's looking more like they are tools that respond to natural language with noticeable limitations, and that is a different framing.
Let’s see where there are gaps with your thinking: First the topic of the convo is that we don’t know anything about LLMs so we can’t make a claim that LLMs can’t reason.
Your response doesn’t have anything to do with that anymore. You’ve went off topic into hype, what anthropic is trying to find out about LLMs and a bunch of other tangents and have failed to address the topic of: LLMs can’t reason.
Even LLMs don’t hallucinate past the give topic.
You keep using "hallucinate" to compare LLMs and humans, but the word itself demonstrates the difference. Human hallucination is about incorrect perception of the world, not generating tokens. These are meaningful differences.
You say "we don't know anything about LLMs" but we do know how to make them, and how they are structured, and Anthropic has been making progress in explaining their side effects.
Your argument is not sound, and would include any form of computation as reasoning, so we could just say our phones and laptops are all doing reasoning as well. After all we don't know what the human brain is doing and it probably is a computer, therefore all computers reason. As you can see that does not sound right.
No he is not, as clearly demonstrated by this very thread.
They highlight fun philosophical / definitional questions like the Chinese room thought experiment, but that's it.
How do you arrive at a claim without evidence? There's a word for it. It's called an Hallucination. Humans, like LLMs, have trouble saying "I don't know."
There is a lot of art involved with the best models. We aren't dealing with determinism with regard to the corpus used to train the model (too much information to curate for accuracy) nor the LLM output (probabilistic by design) nor the prompt input (humans language is dynamic and open to multiple interpretations) nor the human subjective assessment of the output.
That there is a product available that manages to produce great results despite these huge challenges is amazing and this is the part that is not quantifiable - in the sense that the data scientists decisions made with regard to temperature settings etc are not derived from any fundamental property but more from an inspired group of people with qualitative instincts aligned with producing great results
Let me tell you something. You can go online, find a tutorial on how to make an LLM, and actually make one at home. The only thing stopping you from making an OpenAI scale LLM is compute resources.
If you happen to make an LLM from scratch you won't know how it works, and neither do the people who are at OpenAi or anthropic.
It is also unclear that the current rate of progress is in the direction that would solve this issue. I think generative AI for images and video will get better, but the reasoning capabilities seem to be in a different domain.
Humans don't reason either. Reasoning is something we do in writing, especially with mathematical and logical notation. Just about everything else that feels like reasoning is something much less.
This has been widely known at least since the stories where Socrates made everybody look like fools. But it's also what the psychological research shows. What people feel like they're doing when they're reasoning is very different with what they're actually doing.
Reasoning is something like structured thoughts. You have a series of thoughts that build on each other to produce some conclusion (also a thought). If we assume that the brain is a computer, then thoughts and reasoning are implemented on brain software with some kind of algorithm... and I think it's pretty obvious this algorithm is completely different than what happens in LLMs... to the extent that we can safely say it is not reasoning like the brain does.
There is also a semantic argument here, if we say that since we don't know what humans are doing then we can also stretch the word and use it for AI, but I think this is muddying the waters and creating all the hype that I think will not deliver what it's promising.
What the brain does is closer to activating a bunch of different ideas in parallel. Some of those activations rise to the level of awareness, some don't. Each activation triggers others by common association. And we try to make the best of that thought soup by a combination of reward neurochemicals and emotions.
A human brain is nothing at all like a computer in terms of logic. It's much more like an LLM. That makes sense because LLMs came largely from trying to build artificial versions of biological neural networks. One big difference is that LLMs always think linguistically, whereas language is only a relatively small part of what brains do.
The following might be rude and unkind, but at this point frankly necessary: LLM apologists should stop projecting.
Never seen a journalist not make broad generalizations on a scientific study so I am not sure where these numbers are coming from.
It will tell you exactly where the numbers are coming from - the comparison is between (a variety of) LLMs and "expert-written summaries from NEJM Journal Watch".
And "LLM apologists" is so polemical it's hard to take seriously. We get it, you don't like GenAI. That's fine, but can we talk about it without getting normative?
"To systematically assess differences between LLM-generated and human-written summaries, we also collected the corresponding expert-written summaries from NEJM Journal Watch (henceforth ‘NEJM JW’)"
This is a really difficult problem because these people are often very sick and not getting the answers they want from their doctors. Previously this void was filled by alternative medicine doctors and quacks selling supplements. Now ChatGPT has arrived and has convinced a lot of them that they have a super-human AI at their fingertips that can be massaged to produce any answer they want to hear.
It’s painful to try to read some of these forums where threads have turned into endless “here’s what ChatGPT says” pasted walls of text, followed by someone else trying to counter with a different ChatGPT wall of text.
This isn’t unique to LLMs. There is a huge market for grossly exaggerating the conclusions of scientific studies. Podcasters like Huberman and Dr. Rhonda Patrick are famous for taking obscure studies with questionable conclusions and extrapolating to “protocols” or supplement stacks for their fans to follow. I often get downvoted when I mention fan-favorite podcasters by name, but I think by now many listeners have caught on to the way they exaggerate small studies into exciting listening material.
He had a whole host of tricks to work around ChatGPT’s protections or cautious replies. He’d strip out the cautions because he thought it was just OpenAI’s lawyers forcing disclaimers into the model. If he didn’t get the answer he wanted, he’d just retry or rephrase until he did.
I think the other half of the problem is that LLM users can be very good at pushing the LLM into doing confirmation bias. It’s much easier when you can hit the retry button with no side effects, unlike a human who will recognize what you’re trying to do when you keep asking different variations of a question until you get the answer you want.
So it turns out llms trained largely on Internet science articles make the same mistakes as are made by science journalists.
No comments yet