I’m not sure we can accept the premise that LLMs haven’t made any breakthroughs. What if people aren’t giving the LLM credit when they get a breakthrough from it?
First time I got good code out of a model, I told my friends and coworkers about it. Not anymore. The way I see it, the model is a service I (or my employer) pays for. Everyone knows it’s a tool that I can use, and nobody expects me to apportion credit for whether specific ideas came from the model or me. I tell people I code with LLMs, but I don’t commit a comment saying “wow, this clever bit came from the model!”
If people are getting actual bombshell breakthroughs from LLMs, maybe they are rationally deciding to use those ideas without mentioning the LLM came up with it first.
Anyway, I still think Gwern’s suggestion of a generic idea-lab trying to churn out insights is neat. Given the resources needed to fund such an effort, I could imagine that a trading shop would be a possible place to develop such a system. Instead of looking for insights generally, you’d be looking for profitable trades. Also, I think you’d do a lot better if you have relevant experts to evaluate the promising ideas, which means that more focused efforts would be more manageable. Not comparing everything to everything, but comparing everything to stuff in the expert’s domain.
If a system like that already exists at Jane Street or something, I doubt they are going to tell us about it.
therealpygon · 5h ago
It is hard to accept as a premise because the premise is questionable from the beginning.
Google already reported several breakthroughs as a direct result of AI, using processes that almost certainly include LLMs, including a new solution in math, improved chip designs, etc. DeepMind has AI that predicted millions of protein folds which are already being used in drugs among many other things they do, though yes, not an LLM per se. There is certainly the probability that companies won’t announce things given that the direct LLM output isn’t copyrightable/patentable, so a human-in-the-loop solves the issue by claiming the human made said breakthrough with AI/LLM assistance. There isn’t much benefit to announcing how much AI helped with a breakthrough unless you’re engaged in basically selling AI.
As for “why aren’t LLMs creating breakthroughs by themselves regularly”, that answer is pretty obvious… they just don’t really have that capacity in a meaningful way based on how they work. The closest example is Google’s algorithmic breakthrough absolutely was created by a coding LLM, which was effectively achieved through brute force in a well established domain, but that doesn’t mean it wasn’t a breakthrough. That alone casts doubt on the underlying premise of the post.
js8 · 3h ago
I would say that real breakthrough was training NNs as a way to create practical approximators for very complex functions over some kind of many-valued logics. Why they work so well in practice we still don't fully theoretically understand (in the sense we don't know what kind of underlying logic best models what we want from these systems). The LLMs (and application to natural language) are just a consequence of that.
starlust2 · 2h ago
> through brute force
The same is true of humanity in aggregate. We attribute discoveries to an individual or group of researchers but to claim humans are efficient at novel research is a form of survivorship bias. We ignore the numerous researchers who failed to achieve the same discoveries.
suddenlybananas · 2h ago
The fact some people don't succeed doesn't show that humans operate by brute force. To claim humans reason and invent by brute force is patently absurd.
preciousoo · 2h ago
It’s an absurd statement because you are human and are aware of how research works on an individual level.
Take yourself outside of that, and imagine you invented earth, added an ecosystem, and some humans. Wheels were invented ~6k years ago, and “humans” have existed for ~40-300k years. We can do the same for other technologies. As a group, we are incredibly inefficient, and an outside observer would see our efforts at building societies and failing to be “brute force”
suddenlybananas · 29m ago
If I had less information, I might think something stupid, wow
tmaly · 1h ago
What about Dyson and Alexander Graham Bell ?
Yizahi · 3h ago
You are contradicting yourself. Either LLM programs can do breakthrough on their own, or they don't have that capacity in a meaningful way based on how they work.
PaulHoule · 56m ago
Almost certainly an LLM has, in response to a prompt and through sheer luck, spat out the kernel of an idea that a super-human centaur of the year 2125 would see as groundbreaking that hasn't been recognized as such.
We have a thin conception of genius that can be challenged by Edison's "1% inspiration, 99% perspiration" or the process of getting a PhD were you might spend 7 years getting to the point where you can start adding new knowledge and then take another 7 years to really hit your stride.
I have a friend who is 50-something and disabled with some mental illness, he thinks he has ADHD. We had a conversation recently where he repeated expressed his fantasy that he could show up somewhere with his unique perspective and sprinkle some pixie dust on their problems and be rewarded for it. When I would hear his ideas, or if I hear any idea, I immediately think "how would we turn this into a product and sell it?" or "write a paper about it?" or "convince people of it?" and he would have no part of it and think it is uninteresting and that somebody else would all that work and my answer is -- they might, if you're willing and able to do a whole lot of advocacy.
And it comes down to that.
If an LLM were to come up with a groundbreaking idea and be recognized as having a groundbreaking idea it would have to do a sustained amount of work, say at least 2 person × years equivalent to win people over. And they aren't anywhere near equipped to do that, nobody is going to pay the power bill to do that, and if you were paying the power bill you'd probably have to pay the power bill for a million of them to go off in the wrong direction.
nico · 5h ago
> but I don’t commit a comment saying “wow, this clever bit came from the model!”
The other day, Claude Code started adding a small signature to the commit messages it was preparing for me. It said something like “This commit was co-written with Claude Code” and a little robot emoji
I wonder if that just happened by accident or if Anthropic is trying to do something like Apple with the “sent from my iPhone”
Thank you. And I guess they are trying to do the Apple thing by making that option true by default
kajumix · 3h ago
Most interesting novel ideas originate at the intersection of multiple disciplines. Profitable trades could be found in the biomedicine sector when the knowledge of biomedicine and finance are combined. That's where I see LLMs shining because they span disciplines way more than any human can. Once we figure out a way to have them combine ideas (similar to how Gwern is suggesting), there will be, I suspect, a flood of novel and interesting ideas, inconceivable with humans.
Yizahi · 6h ago
This is bordering conspiracy theory. Thousands of people are getting novel breakthroughs generated purely by LLM an not a single person discloses such result? Not even one of the countless LLM corporation engineers who depend on the billion dollar IV injections from deluded bankers just to continue surviving, and not one has bragged about LLM doing that revolution? Hard to believe.
esafak · 4h ago
Countless people are increasing their productivity and talking about it here ad nauseam. Even researchers are leaning on language models; e.g., https://mathstodon.xyz/@tao/114139125505827565
We haven't successfully resolved famous unsolved research problems through language models yet but one can imagine that they will solve increasingly challenging problems over time. And if it happens in the hands of a researcher rather than model's lab, one can also imagine that the researcher will take credit, so you will still have the same question.
AIPedant · 4h ago
The actual posts totally undermine your point:
My general sense is that for research-level mathematical tasks at least, current models fluctuate between "genuinely useful with only broad guidance from user" and "only useful after substantial detailed user guidance", with the most powerful models having a greater proportion of answers in the former category. They seem to work particularly well for questions that are so standard that their answers can basically be found in existing sources such as Wikipedia or StackOverflow; but as one moves into increasingly obscure types of questions, the success rate tapers off (though in a somewhat gradual fashion), and the more user guidance (or higher compute resources) one needs to get the LLM output to a usable form. (2/2)
Yizahi · 3h ago
Increasing productivity is nice and commendable, but it is NOT an LLM making a breakthrough on its own, which is the topic of the Gwern's article.
dingnuts · 2h ago
There is a LOT of money on this message board trying to convince us of the utility of these machines and yes, people talk about it ad nauseum, in vague terms that are unlike anything I see in the real world, with few examples.
Show me the code. Show me your finished product.
BizarroLand · 2h ago
I wonder if it's not the LLM making the breakthrough but rather that the person using the system just needed the information available presented in a clear and orderly fashion to make the breakthrough itself.
After all, the LLM currently has no cognizance, it is unable to understand what it is saying in a meaningful way. At its best it is a P-Noid Zombie machine, right?
In my opinion anything amazing that comes from an LLM only becomes amazing when someone who was capable of recognizing the amazingness perceives it, like a rewrite of a zen koan, "If an LLM generates a new work of William Shakespeare, and nobody ever reads it, was anything of value lost?"
HarHarVeryFunny · 4h ago
> Despite impressive capabilities, large language models have yet to produce a genuine breakthrough. The puzzle is why.
I don't see why this is remotely surprising. Despite all the hoopla, LLMs are not AGI or artifical brains - they are predict-next-word language models. By design they are not built for creativity, but rather quite the opposite, they are designed to continue the input in the way best suggested by the training data - they are essentially built for recall, not creativity.
For an AI to be creative it needs to have innate human/brain-like features such as novelty (prediction failure) driven curiosity, boredom, as well as ability to learn continuously. IOW if you want the AI to be creative it needs to be able to learn for itself, not just regurgitate the output of others, and have these innate mechanisms that will cause it to pursue discovery.
tmaly · 1h ago
I think we will see more breakthroughs with an AI/Human hybrid approach.
Yes LLMs choose probable sequences because they recognize similarity. Because of that, it can diverge from similarity to be creative: increase the temperature. What LLMs don't have is (good) taste—we need to build an artificial tongue and feed it as a prerequisite.
HarHarVeryFunny · 4h ago
It depends on what you mean by "creative" - they can recombine fragments of training data (i.e. apply generative rules) in any order - generate the deductive closure of the training set, but that is it.
Without moving beyond LLMs to a more brain-like cognitive architecture, all you can do is squeeze the juice out of the training data, by using RL/etc to bias the generative process (according to reasoning data, good taste or whatever), but you can't move beyond the training data to be truly creative.
awongh · 3h ago
By volume how much of human speech / writing is pattern matching and how much of it is truly original cognition that would pass your bar of creativity? It is probably 90% rote pattern matching.
I don't think LLMs are AGI, but in most senses I don't think people give enough credit to their capabilities.
It's just ironic how human-like the flaws of the system are. (Hallucinations that are asserting untrue facts, just because they are plausible from a pattern matching POV)
HarHarVeryFunny · 2h ago
> It's just ironic how human-like the flaws of the system are. (Hallucinations that are asserting untrue facts, just because they are plausible from a pattern matching POV)
I think most human mistakes are different - not applying a lot of complex logic to come to an incorrect deduction/guess (= LLM hallucination), but rather just shallow recall/guess. e.g. An LLM would guess/hallucinate a capital city by using rules it had learnt about other capital cities - must be famous, large, perhaps have an airport, etc, etc; a human might just use "famous" to guess, or maybe just throw out the name of the only city they can associate to some country/state.
The human would often be aware that they are just guessing, maybe based on not remembering where/how they had learnt this "fact", but to the LLM it's all just statistics and it has no episodic memory (or even coherent training data - it's all sliced and diced into shortish context-length samples) to ground what it knows or does not know.
andoando · 53m ago
What is the distinction between "pattern matching" and "original cognition" exactly?
All human ideas are a combination of previously seen ideas. If you disagree, come up with a truly new conception which is not. -- Badly quoted David hume
dingnuts · 2h ago
My intuition is opposite yours; due to the insane complexity of the real world nearly 90% of situations are novel and require creativity
OK now we're at an impasse until someone can measure this
HarHarVeryFunny · 2h ago
I think it comes down to how we define creativity for the purpose of this conversation. I would say that 100% of situations and problems are novel to some degree - the real world does not exactly repeat, and your brain at T+10 is not exactly the same as it is as T+20.
That said, I think most everyday situations are similar enough to things we've experienced before that shallow pattern matching is all it takes. The curve in the road we're driving on may not be 100% the same as any curve we've experienced before, but turning the car wheel to the left the way we've learnt do do it will let us successfully navigate it all the same.
Most everyday situations/problems we're faced with are familiar enough that shallow "reactive" behavior is good enough - we rarely have to stop to develop a plan, figure things out, or reason in any complex kind of a way, and very rarely face situations so challenging that any real creativity is needed.
leptons · 2h ago
>It is probably 90% rote pattern matching.
So what. 90% (or more) of humans aren't making any sort of breakthrough in any discipline, either. 99.9999999999% of human speech/writing isn't producing "breakthroughs" either, it's just a way to communicate.
>It's just ironic how human-like the flaws of the system are. (Hallucinations that are asserting untrue facts, just because they are plausible from a pattern matching POV)
The LLM is not "hallucinating". It's just operating as it was designed to do, which often produces results that do not make any sense. I have actually hallucinated, and some of those experiences were profoundly insightful, quite the opposite of what an LLM does when it "hallucinates".
You can call anything a "breakthrough" if you aren't aware of prior art. And LLMs are "trained" on nothing but prior art. If an LLM does make a "breakthrough", then it's because the "breakthrough" was already in the training data. I have no doubt many of these "breakthroughs" will be followed years later by someone finding the actual human-based research that the LLM consumed in its training data, rendering the "breakthrough" not quite as exciting.
vonneumannstan · 4h ago
>It depends on what you mean by "creative" - they can recombine fragments of training data (i.e. apply generative rules) in any order - generate the deductive closure of the training set, but that is it.
Without moving beyond LLMs to a more brain-like cognitive architecture, all you can do is squeeze the juice out of the training data, but using RL/etc to bias the generative process (according to reasoning data, good taste or whatever), but you can't move beyond the training data to be truly creative.
It's clear these models can actually reason on unseen problems and if you don't believe that you aren't actually following the field.
HarHarVeryFunny · 4h ago
Sure - but only if the unseen problem can be solved via the deductive/generative closure of the training data. And of course this type of "reasoning" is only as good as the RL pre-training it is based on - working well for closed domains like math where verification is easy, and not so well in the more general case.
js8 · 3h ago
Both can be true (and that's why I downvoted you in the other comment, for presenting this as a dichotomy), LLMs can reason and yet "stochastically parrot" the training data.
For example, LLM might learn a rule that sentences that are similar to "A is given. From A follows B.", are followed by statement "Therefore, B". This is modus ponens. LLM can apply this rule to wide variety of A and B, producing novel statements. Yet, these statements are still the statistically probable ones.
I think the problem is, when people say "AI should produce something novel" (or "are producing", depending whether they advocate or dismiss), they are not very clear what the "novel" actually means. Mathematically, it's very easy to produce a never-before-seen theorem; but is it interesting? Probably not.
grey-area · 4h ago
Well they also don't have understanding, a model of the world, and the ability to reason (no chain-of-thought created by AI companies is not reasoning), as well as having no taste.
So there is quite a lot missing.
fragmede · 3h ago
Define creativity. Three things LLMs can do is write song lyrics, poems, and jokes, all of which require some level of what we think of as human creativity. Of course detractors will say LLM versions of those three aren't very good, and they may even be right, but a twelve year old child coming up with the same would be seen as creative, even if they didn't get significant recognition for it.
HarHarVeryFunny · 2h ago
Sure, but the author of TFA is well versed in LLMs and so is addressing something different. Novelty isn't the same as creativity, especially when limited to generating based on a fixed repertoire of moves.
The term "deductive closure" has been used to describe what LLMs are capable of, and therefore what they are not capable of. They can generate novelty (e.g. new poem) by applying the rules they have learnt in novel ways, but are ultimately restricted by their fixed weights and what was present in the training data, as well as being biased to predict rather than learn (which they anyways can't!) and explore.
An LLM may do a superhuman job of applying what it "knows" to create solutions to novel goals (be that a math olympiad problem, or some type of "creative" output that has been requested, such as a poem), but is unlikely to create a whole new field of math that wasn't hinted at in the training data because it is biased to predict, and anyways doesn't have the ability to learn that would allow it to build a new theory from the ground up one step at a time. Note (for anyone who might claim otherwise) that "in-context learning" is really a misnomer - it's not about learning but rather about using data that is only present in-context rather than having been in the training set.
vonneumannstan · 4h ago
> Despite all the hoopla, LLMs are not AGI or artifical brains - they are predict-next-word language models. By design they are not built for creativity, but rather quite the opposite, they are designed to continue the input in the way best suggested by the training data - they are essentially built for recall, not creativity.
This is just a completely base level of understanding of LLMs. How do you predict the next token with superhuman accuracy? Really think about how that is possible. If you think it's just stochastic parroting you are ngmi.
>large language models have yet to produce a genuine breakthrough. The puzzle is why.
I think you should really update on the fact that world class researchers are surprised by this. They understand something you don't and that is that it's clear these models build robust world models and that text prompts act as probes into those world models. The surprising part is that despite these sophisticated world models we can't seem to get unique insights out which almost surely already exist in those models. Even if all the model is capable of is memorizing text then just the sheer volume it has memorized should yield unique insights, no human can ever hope to hold this much text in their memory and then make connections between it.
It's possible we just lack the prompt creativity to get these insights out but nevertheless there is something strange happening here.
HarHarVeryFunny · 3h ago
> This is just a completely base level of understanding of LLMs. How do you predict the next token with superhuman accuracy? Really think about how that is possible. If you think it's just stochastic parroting you are ngmi.
Yes, thank-you, I do understand how LLMs work. They learn a lot of generative rules from the training data, and will apply them in flexible fashion according to the context patterns they have learnt. You said stochastic parroting, not me.
However, we're not discussing whether LLMs can be superhuman at tasks where they had the necessary training - we're discussing whether they are capable of creativity (and presumably not just the trivially obvious case of being able to apply their generative rules in any order - deductive closure, not stochastic parroting in the dumbest sense of that expression).
zhangjunphy · 10h ago
I also hope we have something like this. But sadly, this is not going to work. The reason is this line from the article, which is so much harder that it looks:
> and a critic model filters the results for genuinely valuable ideas.
In fact, people have tryied this idea. And if you use a LLM or anything similar as the critic, the performance of the model actually degrades in this process. As the LLM tries too hard to satisfy the critic, and the critic itself is far from a good reasoner.
So the reason that we don't hear too much about this idea is not that nobody tried it. But that they tried, and it didn't work, and people are reluctant to publish about something which does not work.
imiric · 9h ago
Exactly.
This not only affects a potential critic model, but the entire concept of a "reasoning" model is based on the same flawed idea—that the model can generate intermediate context to improve its final output. If that self-generated context contains hallucinations, baseless assumptions or doubt, the final output can only be an amalgamation of that. I've seen the "thinking" output arrive at a correct solution in the first few steps, but then talk itself out of it later. Or go into logical loops, without actually arriving at anything.
The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data. There's nothing inherently better about them. There's nothing intelligent either, but that's a separate discussion.
yorwba · 9h ago
Reasoning models are trained from non-reasoning models of the same scale, and the training data is the output of the same model, filtered through a verifier. Generating intermediate context to improve the final output is not an idea that reasoning models are based on, but an outcome of the training process. Because empirically it does produce answers that pass the verifier more often if it generates the intermediate steps first.
That the model still makes mistakes doesn't mean it's not an improvement: the non-reasoning base model makes even more mistakes when it tries to skip straight to the answer.
imiric · 5h ago
Thanks. I trust that you're more familiar with the internals than myself, so I stand corrected.
I'm only speaking from personal usage experience, and don't trust benchmarks since they are often gamed, but if this process produces objectively better results that aren't achieved by scaling up alone, then that's a good thing.
danenania · 7h ago
> The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data.
Except that we can try the exact same pre-trained model with reasoning enabled vs. disabled and empirically observe that reasoning produces better, more accurate results.
imiric · 5h ago
I'm curious: can you link to any tests that prove this?
I don't trust most benchmarks, but if this can be easily confirmed by an apples-to-apples comparison, then I would be inclined to believe it.
danenania · 5h ago
Check out the DeepSeek paper.
Research/benchmarks aside, try giving a somewhat hard programming task to Opus 4 with reasoning off vs. on. Similarly, try the same with o3 vs. o3-pro (o3-pro reasons for much longer).
I'm not going to dig through my history for specific examples, but I do these kinds of comparisons occasionally when coding, and it's not unusual to have e.g. a bug that o3 can't figure out, but o3-pro can. I think this is widely accepted by engineers using LLMs to help them code; it's not controversial.
imiric · 4h ago
Huh, I wasn't aware that reasoning could be toggled. I use the OpenRouter API, and just saw that this is supported both via their web UI and API. I'm used to Sonnet 3.5 and 4 without reasoning, and their performance is roughly the same IME.
I wouldn't trust comparing two different models, even from the same provider and family, since there could be many reasons for the performance to be different. Their system prompts, training data, context size, or runtime parameters could be different. Even the same model with the same prompt could have varying performance. So it's difficult to get a clear indication that the reasoning steps are the only changing variable.
But toggling it on the same model would be a more reliable way to test this, so I'll try that, thanks.
jacobr1 · 4h ago
It depends on the problem domain you have and the way you prompt things. Basically the reasoning is better, in cases where using the same model to critique itself in multiple turns would be better.
With code, for example, if a single shot without reasoning would have hallucinating a package or not conformed to the rest of the project style. Then you ask the llm check. Then ask it to revise itself to fix the issue. If the base model can do that - then turning on reasoning, basically allows it to self check for the self-correctable features.
When generating content, you can ask it to consider or produce intermediate deliverables like summaries of input documents that it then synthesizes into the whole. With reasoning on, it can do the intermediate steps and then use that.
The main advantage is that the system is autonomously figuring out a bunch of intermediate steps and working through it. Again no better than it probably could do with some guidance on multiple interactions - but that itself is a big productivity benefit. The second gen (or really 1.5 gen) reasoning models also seem to have been trained on enough reasoning traces that they are starting to know about additional factors to consider so the reasoning loop is tighter.
amelius · 8h ago
But what if the critic is just hard reality? If you ask an LLM to write a computer program, instead of criticizing it, you can run it and test it. If you ask an LLM to prove a theorem, let it write the proof in a formal logic language so it can be verified. Etcetera.
Yizahi · 6h ago
Generated code only works because "test" part (compile/validate/analyze etc.) is completely external and written before any mass-market LLMs. There is no such external validator for new theorems, books, pictures, text guides etc. You can't just run hard_reality.exe on a generated poem or a scientific paper to deem it "correct". It is only possible with programming languages, and even then not always.
amelius · 6h ago
Science is falsifiable by definition, and writing poems/books is not the kind of problem of interest here.
> There is no such external validator for new theorems
There are formal logic languages that will allow you to do this.
Yizahi · 6h ago
Your proposed approach to science would result in the extremely tiny subset of math, probably theorems being proven by automation. And it is questionable if those theorems would be even useful. A good mathematician with CS experience can probably write a generator of new useless theorems, something along "are every sequential cube plus square of a number divisible by a root of seventh smallest prime multiplied by logn of than number plus blabla...". One can generate such theorrems and formally prove or disprove them, yes.
On the other hand any novel science usually requires deep and wide exploratory research, often involving hard or flawed experimentation or observation. One can train LLM on a PhD curriculum in astrophysics, then provide that LLM with API to some new observatory and instruct it to "go prove cosmological constant". And it will do so, but the result will be generated garbage because there is no formal way to prove such results. There is no formal way to prove why pharaohs decided to stop building pyramids, despite there being some decent theories. This is science too, you know. You can't formally prove that some gene sequence is responsible for trait X etc.
I would say a majority of science is not formally provable.
And lastly, you dismiss books/texts, but that is a huge chunk of intellectual and creative work of humans. Say you are an engineer and you have a CAD model with a list of parts and parameters for rocket for example. Now you need to write a guide for it. LLM can do that, it can generate guide-looking output. The issue is that there is no way to automatically proof it or find issues in it. And there are lots of items like that.
jacobr1 · 3h ago
> You can't formally prove that some gene sequence is responsible for trait X etc.
Maybe not formally in some kind of mathematical sense. But you certainly could have simulation models of protein synthesis, and maybe even higher order simulation of tissues and organs. You could also let the ai scientist verify the experimental hypothesis by giving access to robotic lab processes. In fact it seems we are going down both fronts right now.
Yizahi · 3h ago
Nobody argues that LLMs aren't useful for some bulk processing of billion datapoints or looking for obscure correlations in the unedited data. But the premise of the Gwern's article is that to be considered thinking, LLM must initiate such search on it's own and arrive to a novel conclusion on it's own.
Basically if:
A) Scientist has an idea > triggers LLM program to sift through a ton of data > LLM print out correlation results > scientist read them and proves/disproves an idea. In this case, while LLM did a bulk of work here, it did not arrive at a breakthrough on its own.
B) LLM is idling > then LLM triggers some API to get some specific set of data > LLM correlates results > LLM prints out a complete hypothesis with proof (or disproves it). In this case we can say that LLM did a breakthrough.
amelius · 6h ago
I think the problem here is that you assume the LLM has to operate isolated from the world, i.e. without interaction. If you put a human scientist in isolation, then you cannot have high expectations either.
Yizahi · 3h ago
I assume not that LLM would be isolated, I assume that LLM would be incapable of interacting in any meaningful way on its own (i.e. not triggered by direct input from a programmer).
yunohn · 5h ago
IME, on a daily basis, Claude Code (supposed SoTA agent) constantly disables and bypasses tests and checks on my codebase - despite following clear prompting guidelines and all the /woo/ like ultrathink etc.
I think if we can have a good enough simulation of reality, and a fast one. Something like an accelerable minecraft with real world physics. Then this idea might actually work.
But the hard reality we currenly could generate efficiently and feed into LLMs usually has a narrow scope. It feels liking teaching only textbook math to a kid for several years but nothing else. The LLM mostly overoptimize in these very specific fields, but the overall performance might even be worse.
dpoloncsak · 5h ago
Its gotta be G-Mod
leetbulb · 5h ago
There will never be a computer powerful enough to simulate that many paperclips and explosive barrels.
imtringued · 6h ago
That didn't stop actor-critic from becoming one of the most popular deep RL methods.
zhangjunphy · 6h ago
True, and the successful ones usually require an external source of information.
For AlphaGo, it is the simple algorithm which decide who is the winner of a game of Go. For GAN, it is the images labled by human.
In these scenarios, the critic is the medium which transforms external information into gradient which optimized the actor, but not the direct source of that information.
blueflow · 10h ago
I have not yet seen AI doing a critical evaluation of data sources. AI willcontradict primary sources if the contradiction is more prevalent in the training data.
Something about the whole approach is bugged.
My pet peeve: "Unix System Resources" as explanation for the /usr directory is a term that did not exist until the turn of the millenium (rumor is that a c't journalist made it up in 1999), but AI will retcon it into the FHS (5 years earlier) or into Ritchie/Thompson/Kernigham (27 years earlier).
_heimdall · 7h ago
> Something about the whole approach is bugged.
The bug is that LLMs are fundamentally designed for natural language processing and prediction, not logic or reasoning.
We may get to actual AI eventually, but an LLM architecture either won't be involved at all or it will act as a part of the system mimicking the language center of a brain.
throwaway328 · 1h ago
The fact that LLMs haven't come up with anything "novel" would be a serious puzzle - as the article claims - only if they were thinking, reasoning, being creative, etc. If they aren't doing anything of the sort, it'd be the only thing you'd expect.
So it's a bit of an anti-climactic solution to the puzzle but: maybe the naysayers were right and they're not thinking at all, or doing any of the other anthropomorphic words being marketed to users, and we've simply all been dragged along by a narrative that's very seductive to tech types (the computer gods will rise!).
It'd be a boring outcome, after the countless gallons of digital ink spilled on the topic the last years, but maybe they'll come to be accepted as "normal software", and not god-like, in the end. A medium to large improvement in some areas, and anywhere from minimal to pointless to harmful in others. And all for the very high cost of all the funding and training and data-hoovering that goes in to them, not to mention the opportunity cost of all the things we humans could have been putting money into and didn't.
zby · 1h ago
The novelty part is a hard one - but maybe in many cases we could substitute something else for it? If an idea promises to beat state of the art in some field - and it is not yet actively researched - then it is novel.
But most promising would be to use the Dessalles theories.
By the way - this could be a classic example of this day dreaming - you take two texts: one by Gwern and some article by Dessalles (I read "Why we talk" - a great book! - but maybe there is some more concise article?) and ask LLM to generate ideas connecting these two.
In this particular case it was my intuition that connected them - but I imagine that there could be an algorithm that could find this connection in a reasonable time - some kind of semantic search maybe.
jumploops · 9h ago
How do you critique novelty?
The models are currently trained on a static set of human “knowledge” — even if they “know” what novelty is, they aren’t necessarily incentivized to identify it.
In my experience, LLMs currently struggle with new ideas, doubly true for the reasoning models with search.
What makes novelty difficult, is that the ideas should be nonobvious (see: the patent system). For example, hallucinating a simpler API spec may be “novel” for a single convoluted codebase, but it isn’t novel in the scope of humanity’s information bubble.
I’m curious if we’ll have to train future models on novelty deltas from our own history, essentially creating synthetic time capsules, or if we’ll just have enough human novelty between training runs over the next few years for the model to develop an internal fitness function for future novelty identification.
My best guess? This may just come for free in a yet-to-be-discovered continually evolving model architecture.
In either case, a single discovery by a single model still needs consensus.
Peer review?
n4r9 · 9h ago
It's a good question. A related question is: "what's an example of something undeniably novel?". Like if you ask an agent out of the blue to prove the Collatz conjecture, and it writes out a proof or counterexample. If that happens with LLMs then I'll be a lot more optimistic about the importance to AGI. Unfortunately, I suspect it will be a lot murkier than that - many of these big open questions will get chipped away at by a combination of computational and human efforts, and it will be impossible to pinpoint where the "novelty" lies.
jacobr1 · 3h ago
Good point. Look at patents. Few are truly novel in some exotic sense of "the whole idea is something never seen before." Most likely it is a combination of known factors applied in a new way, or incremental development improving on known techniques. In a banal sense, most LLM content generated is novel, in that the specific paragraphs might be unique combinations of words, even if the ideas are just slightly rearranged regurgitations.
So I strongly agree that, especially when are talking about the bulk of human discovery and invention, the incrementalism will be increasingly in striking distance of human/AI collaboration. Attribution of the novelty in these cases is going to be unclear, when the task is, simplified something like, "search for combinations of things, in this problem domain, that do the task better than some benchmark" be that drug discovery, maths, ai itself or whatever.
zbyforgotp · 52m ago
I think our minds don’t use novelty - but salience and it also might be easier to implement.
A_D_E_P_T · 5h ago
> You are a creative synthesizer. Your task is to find deep, non-obvious,
and potentially groundbreaking connections between the two following concepts.
Do not state the obvious. Generate a hypothesis, a novel analogy,
a potential research question, or a creative synthesis.
Be speculative but ground your reasoning.
> Concept 1: {Chunk A}
> Concept 2: {Chunk B}
In addition to the other criticisms mentioned by posters ITT, a problem I see is: What concepts do you feed it?
Obviously there's a problem with GIGO. If you don't pick the right concepts to begin with, you're not going to get a meaningful result. But, beyond that, human discovery (in mechanical engineering, at least,) tends to be massively interdisciplinary and serendipitous, so that many concepts are often involved, and many of those are necessarily non-obvious.
I guess you could come up with a biomimetics bot, but, besides that, I'm not so sure how well this concept would work as laid out above.
There's another issue in that LLMs tend to be extremely gullible, and swallow the scientific literature and University press releases verbatim and uncritically.
zyklonix · 1h ago
This idea of a “daydreaming loop” hits on a key LLM gap, the lack of background, self-driven insight. A pragmatic step in this direction is https://github.com/DivergentAI/dreamGPT , which explores divergent thinking by generating and scoring hallucinations. It shows how we might start pushing LLMs beyond prompt-response into continuous, creative cognition.
khalic · 9h ago
Ugh, again with the anthropomorphizing. LLMs didn't come up with anything new because _they don't have agency_ and _do not reason_...
We're looking at our reflection and asking ourselves why it isn't moving when we don't
yorwba · 9h ago
If you look at your reflection in water, it may very well move even though you don't. Similarly, you don't need agency or reasoning to create something new, random selection from a large number of combinations is enough, correct horse battery staple.
Of course random new things are typically bad. The article is essentially proposing to generate lots of them anyway and try to filter for only the best ones.
RALaBarge · 5h ago
I agree that brute forcing is a method and how nature does it. The problem would still be the same, how would it or other LLMs know if the idea is novel and interesting?
Given access to unlimited data, LLMs likely could spot novel trends that we cant but still cant judge the value of creating something unique that it has never encountered before.
RALaBarge · 5h ago
Yet.
amelius · 8h ago
> anthropomorphizing
Gwern isn't doing that here. They say: "[LLMs] lack some fundamental aspects of human thought", and then investigates that.
johnfn · 11h ago
It's an interesting premise, but how many people
- are capable of evaluating the LLM's output to the degree that they can identify truly unique insights
- are prompting the LLM in such a way that it could produce truly unique insights
I've prompted an LLM upwards of 1,000 times in the last month, but I doubt more than 10 of my prompts were sophisticated enough to even allow for a unique insight. (I spend a lot of time prompting it to improve React code.) And of those 10 prompts, even if all of the outputs were unique, I don't think I could have identified a single one.
I very much do like the idea of the day-dreaming loop, though! I actually feel like I've had the exact same idea at some point (ironic) - that a lot of great insight is really just combining two ideas that no one has ever thought to combine before.
zyklonix · 1h ago
Totally agree, most prompts (especially for code) aren’t designed to surface novel insights, and even when they are, it’s hard to recognize them. That’s why the daydreaming loop is so compelling: it offloads both the prompting and the novelty detection to the system itself. Projects like https://github.com/DivergentAI/dreamGPT are early steps in that direction, generating weird idea combos autonomously and scoring them for divergence, without user prompting at all.
cantor_S_drug · 10h ago
> are capable of evaluating the LLM's output to the degree that they can identify truly unique insights
I noticed one behaviour in myself. I heard about a particular topic, because it was a dominant opinion in the infosphere. Then LLMs confirmed that dominant opinion (because it was heavily represented in the training) and I stopped my search for alternative viewpoints. So in a sense, LLMs are turning out to be another reflective mirror which reinforces existing opinion.
MrScruff · 7h ago
Yes, it seems like LLMs are system one thinking taken to the extreme. Reasoning was supposed to introduce some actual logic but you only have to play with these models for a short while to see that the reasoning tokens are a very soft constraint on the models eventual output.
Infact, they're trained to please us and so in general aren't very good at pushing back. It's incredibly easy to 'beat' an LLM in an argument since they often just follow your line of reasoning (it's in the models context after all).
pilooch · 8h ago
AlphaEvolve and similar systems based on map-elites + DL/LLM + RL appears to be one of the promising paths.
Setting up the map-elites dimensions may still be problem-specific but this could be learnt unsupervisedly, at least partially.
The way I see LLMs is as a search-spqce within tokens that manipulate broad concepts within a complex and not so smooth manifold. These concepts can be refined within other spaces (pixel -space, physical spaces, ...)
velcrovan · 6h ago
I’m once again begging people to read David Gelernter’s 1994 book “The Muse in the Machine”. I’m surprised to see no mention of it in Gwern’s post, it’s the exact book he should be reaching for on this topic.
In examining the possibility of genuinely creative computing, Gelernter discovers and defends a model of cognition that explains so much about the human experience of creativity, including daydreaming, dreaming, everyday “aha” moments, and the evolution of human approaches to spirituality.
This mirrors something I have thought of too. I have read multiple theories of emerging consciousness, which touch on things from proprioception to the inner monologue (which not everyone has.)
My own theory is that -- avoiding the need for an awareness of a monologue -- a LLM loop that constantly takes input and lets it run, saving key summarised parts to memory that are then pulled back in when relevant, would be a very interesting system to speak to.
It would need two loops: the constant ongoing one, and then for interaction, one accessing memories from the first. The ongoing one would be aware of the conversation. I think it would be interesting to see what, via the memory system, would happen in terms of the conversation emitting elements from the loop.
My theory is that if we're likely to see emergent consciousness, it will come through ongoing awareness and memory.
sartak · 5h ago
From The Metamorphosis of Prime Intellect (1994):
> Among Prime Intellect's four thousand six hundred and twelve interlocking programs was one Lawrence called the RANDOM_IMAGINATION_ENGINE. Its sole purpose was to prowl for new associations that might fit somewhere in an empty area of the GAT. Most of these were rejected because they were useless, unworkable, had a low priority, or just didn't make sense. But now the RANDOM_IMAGINATION_ENGINE made a critical connection, one which Lawrence had been expecting it to make [...]
> Deep within one of the billions of copies of Prime Intellect, one copy of the Random_Imagination_Engine connected two thoughts and found the result good. That thought found its way to conscious awareness, and because the thought was so good it was passed through a network of Prime Intellects, copy after copy, until it reached the copy which had arbitrarily been assigned the duty of making major decisions -- the copy which reported directly to Lawrence. [...]
> "I've had an idea for rearranging my software, and I'd like to know what you think."
> At that Lawrence felt his blood run cold. He hardly understood how things were working as it was; the last thing he needed was more changes. "Yes?"
nsedlet · 2h ago
I believe an important reason for why there are no LLM breakthroughs is that humans make progress in their thinking through experimentation, i.e. collecting targeted data, which requires exerting agency on the real world. This isn't just observation, it's the creation of data not already in the training set.
haolez · 2h ago
Maybe also the fact that they can't learn small pieces of new information without "formatting" its whole brain again, from scratch. And fine tuning is like having a stroke, where you get specialization by losing cognitive capabilities.
_acco · 4h ago
This is a good way of framing that we don't understand human creativity. And that we can't hope to build it until we do.
i.e. AGI is a philosophical problem, not a scaling problem.
Though we understand them little, we know the default mode network and sleep play key roles. That is likely because they aid some universal property of AGI. Concepts we don't understand like motivation, curiosity, and qualia are likely part of the picture too. Evolution is far too efficient for these to be mere side effects.
(And of course LLMs have none of these properties.)
When a human solves a problem, their search space is not random - just like a chess grandmaster's search space of moves is not random.
How our brains are so efficient when problem solving while also able to generate novelty is a mystery.
cranium · 9h ago
I'd be happy to spend my Claude Max tokens during the night so it can "ultrathink" some Pareto improvements to my projects. So far, I've mostly seen lateral moves that rewrites code rather than rearchitecture/design the project.
ramoz · 2h ago
Ive walked 10k steps everyday the past week and produced more code in that period than most would over months. Using Claude Code (and vibetunnel over tailscale to my phone- that I speak instructions into).
There is a breakthrough happening. in real time.
CaptainFever · 1h ago
Can we see an example please?
OtherShrezzing · 9h ago
Google's effort with AlphaEvolve shows that the Daydream Factory approach might not be the big unlock we're expecting. They spent an obscene amount of compute to discover a marginal improvement over the state of the art in a very narrow field. Hours after Google published the paper, mathematicians pointed out that their SOTA algorithms underperformed compared to techniques published in the 50 years ago.
Intuitively, it doesn't feel like scaling up to "all things in all fields" is going to produce substantial breakthroughs, if the current best-in-class implementation of the technique by the worlds leading experts returned modest results.
js8 · 5h ago
I am not sure why tie this to any concrete AI technology such as LLMs. IMHO the biggest issue we have with AI right now is that we don't know how to philosophicaly formalize what we want. What is reasoning?
I am trying to answer that for myself. Since every logic is expressible in untyped lambda calculus (as any computation is), you could have a system that just somehow generates terms and beta-reduces them. In even so much simpler logic, what are the "interesting" terms?
I have several answers, but my point is, you should simplify the problem and this question has not been answered even under such simple scenario.
HarHarVeryFunny · 4h ago
Reasoning is chained what-if prediction, together with exploration of alternatives (cf backtracking), and leans upon general curiosity/learning for impasse resolution (i.e. if you can't predict what-if, then have the curiosity to explore and find out).
What the LLM companies are currently selling as "reasoning" is mostly RL-based pre-training whereby the model is encouraged to predict tokens (generate reasoning steps) according to similar "goals" seen in the RL training data. This isn't general case reasoning, but rather just "long horizon" prediction based on the training data. It helps exploit the training data, but isn't going to generate novelty outside of the deductive closure of the training data.
js8 · 3h ago
I am talking about reasoning in philosophical not logical sense. In your definition, you're assuming a logic in which reasoning happens, but when I am asking the question, I am not presuming any specific logic.
So how do you pick the logic in which to do reasoning? There are "good reasons" to use one logic over another.
LLMs probably learn some combination of logic rules (deduction rules in commonly used logics), but cannot guarantee they will be used consistently (i.e. choose a logic for the problem and stick to it). How do you accomplish that?
And even then reasoning is more than search. If you can reason, you should also be able to reason about more effective reasoning (for example better heuristics to cutting the search tree).
HarHarVeryFunny · 3h ago
OK, so maybe we're talking somewhat at cross purposes.
I was talking about the process/mechanism of reasoning - how do our brains appear to implement the capability that we refer to as "reasoning", and by extension how could an AI do the same by implementing the same mechanisms.
If we accept prediction (i.e use of past experience) as the mechanistic basis of reasoning, then choice of logic doesn't really come into it - it's more just a matter of your past experience and what you have learnt. What predictive rules/patterns have you learnt, both in terms of a corpus of "knowledge" you can bring to bear, but also in terms of experience with the particular problem domain - what have you learnt (i.e. what solution steps can you predict) about trying to reason about any given domain/goal ?
In terms of consistent use of logic, and sticking to it, one of the areas where LLMs are lacking is in not having any working memory other than their own re-consumed output, as well as an inability to learn beyond pre-training. With both of these capabilities an AI could maintain a focus (working memory) on the problem at hand (vs suffer from "context rot") and learn consistent, or phased/whatever, logic that has been successful in the past at solving similar problems (i.e predicting actions that will lead to solution).
js8 · 2h ago
But prediction as the basis for reasoning (in epistemological sense) requires the goal to be given from the outside, in the form of the system that is to be predicted. And I would even say that this problem (giving predictions) has been solved by RL.
Yet, the consensus seems to be we don't quite have AGI; so what gives? Clearly just making good predictions is not enough. (I would say current models are empiricist to the extreme; but there is also rationalist position, which emphasizes logical consistency over prediction accuracy.)
So, in my original comment, I lament that we don't really know what we want (what is the objective). The post doesn't clarify much either. And I claim this issue occurs with much simpler systems, such as lambda calculus, than reality-connected LLMs.
HarHarVeryFunny · 1h ago
> But prediction as the basis for reasoning (in epistemological sense) requires the goal to be given from the outside, in the form of the system that is to be predicted.
Prediction doesn't have goals - it just has inputs (past and present) and outputs (expected inputs). Something that is on your mind (perhaps a "goal") is just a predictive input that will cause you to predict what happens next.
> And I would even say that this problem (giving predictions) has been solved by RL.
Making predictions is of limited use if you don't have the feedback loop of when your predictions are right or wrong (so update prediction for next time), and having the feedback (as our brain does) of when your prediction is wrong is the basis of curiosity - causing us to explore new things and learn about them.
> Yet, the consensus seems to be we don't quite have AGI; so what gives? Clearly just making good predictions is not enough.
Prediction is important, but there are lots of things missing from LLMs such as ability to learn, working memory, innate drives (curiosity, boredom), etc.
LourensT · 8h ago
Regardless of accusations of anthropomorphizing, continual thinking seems to be a precursor to any sense of agency, simply because agency requires something to be running.
Eventually LLM output degrades when most of the context is its own output. So should there also be an input stream of experience? The proverbial "staring out the window", fed into the model to keep it grounded and give hooks to go off?
NitpickLawyer · 11h ago
Something I haven't seen explored, but I think could perhaps help is to somehow introduce feedback regarding the generation into the context, based on things that are easily computed w/ other tools (like perplexity). In "thinking" models we see a lot of emerging behaviour like "perhaps I should, but wait, this seems wrong", etc. Perhaps adding some signals at regular? intervals could help in surfacing the correct patterns when they are needed.
There's a podcast I listened to ~1.5 years ago, where a team used GPT2, further trained on a bunch of related papers, and used snippets + perplexity to highlight potential errors. I remember them having some good accuracy when analysed by humans. Perhaps this could work at a larger scale? (a sort of "surprise" factor)
kookamamie · 6h ago
> The puzzle is why
The breakthrough isn't in their datasets.
amelius · 7h ago
Humans daydream about problems when they think a problem is interesting. Can an LLM know when a problem is interesting and thereby prune the daydream graph?
zild3d · 7h ago
> The puzzle is why
The feedback loop on novel/genuine breakthroughs is too long and the training data is too small.
Another reason is that there's plenty of incentive to go after the majority of the economy which relies on routine knowledge and maybe judgement, a narrow slice actually requires novel/genuine breakthroughs.
dr_dshiv · 6h ago
Yes! I’ve been prototyping dreaming LLMs based on my downloaded history—and motivated by biomimetic design approaches. Just to surface ideas to myself again.
apples_oranges · 11h ago
If the breakthrough comes, most if not all links on HN will be to machine generated content. But so far it seems that the I in current AI is https://www.youtube.com/watch?v=uY4cVhXxW64 ..
zwaps · 12h ago
Wasn't this already implemented in some agents?
I want to remember I heard about it in several podcasts
aredox · 11h ago
Oh, in the middle of "AI is PhD-level" propaganda (just check Google News to see this is not a strawman argument), some people finally admit in passing "no LLM has ever made a breakthrough".
I agree there's an equivocation going on for "PhD level" between "so smart, it could get a PhD" (as in come up with and publish new research and defend its own thesis) and "it can solve quizzes at the level that PhDs can".
washadjeffmad · 5h ago
Services that make this claim are paying people with PhDs to ask their models questions and then provide feedback on the responses with detailed reasoning.
yahoozoo · 3h ago
I once asked ChatGPT to come up with a novel word that would return 0 Google search results. It came up with “vexlithic” which does indeed return 0 results, at least for me. I thought that was neat.
guelo · 8h ago
In a recent talk [0] Francois Chollet made it sound like all the frontier models are doing Test-Time Adaptation, which I think is a similar concept to Dynamic evaluation that Gwern says is not being done. Apparently Test-Time Adaptation encompasses several techniques some of which modify model weights and some that don't, but they are all about on-the-fly learning.
Seems like an easy hypothesis to quickly smoke test with a couple hundred lines of script, a wikipedia index, and a few grand thrown at an API.
precompute · 9h ago
Variations on increasing compute and filtering results aside, the only way out of this rut is another breakthrough as big, or bigger than transformers. A lot of money is being spent on rebranding practical use-cases as innovation because there's severe lack of innovation in this sphere.
cs702 · 7h ago
The question is: How do we get LLMs to have "Eureka!" moments, on their own, when their minds are "at rest," so to speak?
The OP's proposed solution is a constant "daydreaming loop" in which an LLM is does the following on its own, "unconsciously," as a background task, without human intervention:
1) The LLM retrieves random facts.
2) The LLM "thinks" (runs a chain-of-thought) on those retrieved facts to see if they are any interesting connections between them.
3) If the LLM finds interesting connections, it promotes them to "consciousness" (a permanent store) and possibly adds them to a dataset used for ongoing incremental training.
It could work.
epcoa · 7h ago
The step 3 has been shown to not work over and over again, the “find interesting connections” is the hand wavy magic at this time. LLMs alone don’t seem to be particularly adept at it either.
cs702 · 4h ago
Has this been tried with reinforcement learning (RL)? As the OP notes, it is plausible from a RL perspective that such a bootstrap can work, because it would be (quoting the OP) "exploiting the generator-verifier gap, where it is easier to discriminate than to generate (eg laughing at a pun is easier than making it)." The hit ratio may be tiny, so doing this well would be very expensive.
First time I got good code out of a model, I told my friends and coworkers about it. Not anymore. The way I see it, the model is a service I (or my employer) pays for. Everyone knows it’s a tool that I can use, and nobody expects me to apportion credit for whether specific ideas came from the model or me. I tell people I code with LLMs, but I don’t commit a comment saying “wow, this clever bit came from the model!”
If people are getting actual bombshell breakthroughs from LLMs, maybe they are rationally deciding to use those ideas without mentioning the LLM came up with it first.
Anyway, I still think Gwern’s suggestion of a generic idea-lab trying to churn out insights is neat. Given the resources needed to fund such an effort, I could imagine that a trading shop would be a possible place to develop such a system. Instead of looking for insights generally, you’d be looking for profitable trades. Also, I think you’d do a lot better if you have relevant experts to evaluate the promising ideas, which means that more focused efforts would be more manageable. Not comparing everything to everything, but comparing everything to stuff in the expert’s domain.
If a system like that already exists at Jane Street or something, I doubt they are going to tell us about it.
Google already reported several breakthroughs as a direct result of AI, using processes that almost certainly include LLMs, including a new solution in math, improved chip designs, etc. DeepMind has AI that predicted millions of protein folds which are already being used in drugs among many other things they do, though yes, not an LLM per se. There is certainly the probability that companies won’t announce things given that the direct LLM output isn’t copyrightable/patentable, so a human-in-the-loop solves the issue by claiming the human made said breakthrough with AI/LLM assistance. There isn’t much benefit to announcing how much AI helped with a breakthrough unless you’re engaged in basically selling AI.
As for “why aren’t LLMs creating breakthroughs by themselves regularly”, that answer is pretty obvious… they just don’t really have that capacity in a meaningful way based on how they work. The closest example is Google’s algorithmic breakthrough absolutely was created by a coding LLM, which was effectively achieved through brute force in a well established domain, but that doesn’t mean it wasn’t a breakthrough. That alone casts doubt on the underlying premise of the post.
The same is true of humanity in aggregate. We attribute discoveries to an individual or group of researchers but to claim humans are efficient at novel research is a form of survivorship bias. We ignore the numerous researchers who failed to achieve the same discoveries.
Take yourself outside of that, and imagine you invented earth, added an ecosystem, and some humans. Wheels were invented ~6k years ago, and “humans” have existed for ~40-300k years. We can do the same for other technologies. As a group, we are incredibly inefficient, and an outside observer would see our efforts at building societies and failing to be “brute force”
We have a thin conception of genius that can be challenged by Edison's "1% inspiration, 99% perspiration" or the process of getting a PhD were you might spend 7 years getting to the point where you can start adding new knowledge and then take another 7 years to really hit your stride.
I have a friend who is 50-something and disabled with some mental illness, he thinks he has ADHD. We had a conversation recently where he repeated expressed his fantasy that he could show up somewhere with his unique perspective and sprinkle some pixie dust on their problems and be rewarded for it. When I would hear his ideas, or if I hear any idea, I immediately think "how would we turn this into a product and sell it?" or "write a paper about it?" or "convince people of it?" and he would have no part of it and think it is uninteresting and that somebody else would all that work and my answer is -- they might, if you're willing and able to do a whole lot of advocacy.
And it comes down to that.
If an LLM were to come up with a groundbreaking idea and be recognized as having a groundbreaking idea it would have to do a sustained amount of work, say at least 2 person × years equivalent to win people over. And they aren't anywhere near equipped to do that, nobody is going to pay the power bill to do that, and if you were paying the power bill you'd probably have to pay the power bill for a million of them to go off in the wrong direction.
The other day, Claude Code started adding a small signature to the commit messages it was preparing for me. It said something like “This commit was co-written with Claude Code” and a little robot emoji
I wonder if that just happened by accident or if Anthropic is trying to do something like Apple with the “sent from my iPhone”
We haven't successfully resolved famous unsolved research problems through language models yet but one can imagine that they will solve increasingly challenging problems over time. And if it happens in the hands of a researcher rather than model's lab, one can also imagine that the researcher will take credit, so you will still have the same question.
Show me the code. Show me your finished product.
After all, the LLM currently has no cognizance, it is unable to understand what it is saying in a meaningful way. At its best it is a P-Noid Zombie machine, right?
In my opinion anything amazing that comes from an LLM only becomes amazing when someone who was capable of recognizing the amazingness perceives it, like a rewrite of a zen koan, "If an LLM generates a new work of William Shakespeare, and nobody ever reads it, was anything of value lost?"
I don't see why this is remotely surprising. Despite all the hoopla, LLMs are not AGI or artifical brains - they are predict-next-word language models. By design they are not built for creativity, but rather quite the opposite, they are designed to continue the input in the way best suggested by the training data - they are essentially built for recall, not creativity.
For an AI to be creative it needs to have innate human/brain-like features such as novelty (prediction failure) driven curiosity, boredom, as well as ability to learn continuously. IOW if you want the AI to be creative it needs to be able to learn for itself, not just regurgitate the output of others, and have these innate mechanisms that will cause it to pursue discovery.
Tobias Rees had some interesting thoughts https://www.noemamag.com/why-ai-is-a-philosophical-rupture/ where he poses this idea that AI and humans together can think new types of thoughts that humans alone cannot think.
Without moving beyond LLMs to a more brain-like cognitive architecture, all you can do is squeeze the juice out of the training data, by using RL/etc to bias the generative process (according to reasoning data, good taste or whatever), but you can't move beyond the training data to be truly creative.
I don't think LLMs are AGI, but in most senses I don't think people give enough credit to their capabilities.
It's just ironic how human-like the flaws of the system are. (Hallucinations that are asserting untrue facts, just because they are plausible from a pattern matching POV)
I think most human mistakes are different - not applying a lot of complex logic to come to an incorrect deduction/guess (= LLM hallucination), but rather just shallow recall/guess. e.g. An LLM would guess/hallucinate a capital city by using rules it had learnt about other capital cities - must be famous, large, perhaps have an airport, etc, etc; a human might just use "famous" to guess, or maybe just throw out the name of the only city they can associate to some country/state.
The human would often be aware that they are just guessing, maybe based on not remembering where/how they had learnt this "fact", but to the LLM it's all just statistics and it has no episodic memory (or even coherent training data - it's all sliced and diced into shortish context-length samples) to ground what it knows or does not know.
All human ideas are a combination of previously seen ideas. If you disagree, come up with a truly new conception which is not. -- Badly quoted David hume
OK now we're at an impasse until someone can measure this
That said, I think most everyday situations are similar enough to things we've experienced before that shallow pattern matching is all it takes. The curve in the road we're driving on may not be 100% the same as any curve we've experienced before, but turning the car wheel to the left the way we've learnt do do it will let us successfully navigate it all the same.
Most everyday situations/problems we're faced with are familiar enough that shallow "reactive" behavior is good enough - we rarely have to stop to develop a plan, figure things out, or reason in any complex kind of a way, and very rarely face situations so challenging that any real creativity is needed.
So what. 90% (or more) of humans aren't making any sort of breakthrough in any discipline, either. 99.9999999999% of human speech/writing isn't producing "breakthroughs" either, it's just a way to communicate.
>It's just ironic how human-like the flaws of the system are. (Hallucinations that are asserting untrue facts, just because they are plausible from a pattern matching POV)
The LLM is not "hallucinating". It's just operating as it was designed to do, which often produces results that do not make any sense. I have actually hallucinated, and some of those experiences were profoundly insightful, quite the opposite of what an LLM does when it "hallucinates".
You can call anything a "breakthrough" if you aren't aware of prior art. And LLMs are "trained" on nothing but prior art. If an LLM does make a "breakthrough", then it's because the "breakthrough" was already in the training data. I have no doubt many of these "breakthroughs" will be followed years later by someone finding the actual human-based research that the LLM consumed in its training data, rendering the "breakthrough" not quite as exciting.
It's clear these models can actually reason on unseen problems and if you don't believe that you aren't actually following the field.
For example, LLM might learn a rule that sentences that are similar to "A is given. From A follows B.", are followed by statement "Therefore, B". This is modus ponens. LLM can apply this rule to wide variety of A and B, producing novel statements. Yet, these statements are still the statistically probable ones.
I think the problem is, when people say "AI should produce something novel" (or "are producing", depending whether they advocate or dismiss), they are not very clear what the "novel" actually means. Mathematically, it's very easy to produce a never-before-seen theorem; but is it interesting? Probably not.
So there is quite a lot missing.
The term "deductive closure" has been used to describe what LLMs are capable of, and therefore what they are not capable of. They can generate novelty (e.g. new poem) by applying the rules they have learnt in novel ways, but are ultimately restricted by their fixed weights and what was present in the training data, as well as being biased to predict rather than learn (which they anyways can't!) and explore.
An LLM may do a superhuman job of applying what it "knows" to create solutions to novel goals (be that a math olympiad problem, or some type of "creative" output that has been requested, such as a poem), but is unlikely to create a whole new field of math that wasn't hinted at in the training data because it is biased to predict, and anyways doesn't have the ability to learn that would allow it to build a new theory from the ground up one step at a time. Note (for anyone who might claim otherwise) that "in-context learning" is really a misnomer - it's not about learning but rather about using data that is only present in-context rather than having been in the training set.
This is just a completely base level of understanding of LLMs. How do you predict the next token with superhuman accuracy? Really think about how that is possible. If you think it's just stochastic parroting you are ngmi.
>large language models have yet to produce a genuine breakthrough. The puzzle is why. I think you should really update on the fact that world class researchers are surprised by this. They understand something you don't and that is that it's clear these models build robust world models and that text prompts act as probes into those world models. The surprising part is that despite these sophisticated world models we can't seem to get unique insights out which almost surely already exist in those models. Even if all the model is capable of is memorizing text then just the sheer volume it has memorized should yield unique insights, no human can ever hope to hold this much text in their memory and then make connections between it.
It's possible we just lack the prompt creativity to get these insights out but nevertheless there is something strange happening here.
Yes, thank-you, I do understand how LLMs work. They learn a lot of generative rules from the training data, and will apply them in flexible fashion according to the context patterns they have learnt. You said stochastic parroting, not me.
However, we're not discussing whether LLMs can be superhuman at tasks where they had the necessary training - we're discussing whether they are capable of creativity (and presumably not just the trivially obvious case of being able to apply their generative rules in any order - deductive closure, not stochastic parroting in the dumbest sense of that expression).
> and a critic model filters the results for genuinely valuable ideas.
In fact, people have tryied this idea. And if you use a LLM or anything similar as the critic, the performance of the model actually degrades in this process. As the LLM tries too hard to satisfy the critic, and the critic itself is far from a good reasoner.
So the reason that we don't hear too much about this idea is not that nobody tried it. But that they tried, and it didn't work, and people are reluctant to publish about something which does not work.
This not only affects a potential critic model, but the entire concept of a "reasoning" model is based on the same flawed idea—that the model can generate intermediate context to improve its final output. If that self-generated context contains hallucinations, baseless assumptions or doubt, the final output can only be an amalgamation of that. I've seen the "thinking" output arrive at a correct solution in the first few steps, but then talk itself out of it later. Or go into logical loops, without actually arriving at anything.
The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data. There's nothing inherently better about them. There's nothing intelligent either, but that's a separate discussion.
That the model still makes mistakes doesn't mean it's not an improvement: the non-reasoning base model makes even more mistakes when it tries to skip straight to the answer.
I'm only speaking from personal usage experience, and don't trust benchmarks since they are often gamed, but if this process produces objectively better results that aren't achieved by scaling up alone, then that's a good thing.
Except that we can try the exact same pre-trained model with reasoning enabled vs. disabled and empirically observe that reasoning produces better, more accurate results.
I don't trust most benchmarks, but if this can be easily confirmed by an apples-to-apples comparison, then I would be inclined to believe it.
Research/benchmarks aside, try giving a somewhat hard programming task to Opus 4 with reasoning off vs. on. Similarly, try the same with o3 vs. o3-pro (o3-pro reasons for much longer).
I'm not going to dig through my history for specific examples, but I do these kinds of comparisons occasionally when coding, and it's not unusual to have e.g. a bug that o3 can't figure out, but o3-pro can. I think this is widely accepted by engineers using LLMs to help them code; it's not controversial.
I wouldn't trust comparing two different models, even from the same provider and family, since there could be many reasons for the performance to be different. Their system prompts, training data, context size, or runtime parameters could be different. Even the same model with the same prompt could have varying performance. So it's difficult to get a clear indication that the reasoning steps are the only changing variable.
But toggling it on the same model would be a more reliable way to test this, so I'll try that, thanks.
With code, for example, if a single shot without reasoning would have hallucinating a package or not conformed to the rest of the project style. Then you ask the llm check. Then ask it to revise itself to fix the issue. If the base model can do that - then turning on reasoning, basically allows it to self check for the self-correctable features.
When generating content, you can ask it to consider or produce intermediate deliverables like summaries of input documents that it then synthesizes into the whole. With reasoning on, it can do the intermediate steps and then use that.
The main advantage is that the system is autonomously figuring out a bunch of intermediate steps and working through it. Again no better than it probably could do with some guidance on multiple interactions - but that itself is a big productivity benefit. The second gen (or really 1.5 gen) reasoning models also seem to have been trained on enough reasoning traces that they are starting to know about additional factors to consider so the reasoning loop is tighter.
> There is no such external validator for new theorems
There are formal logic languages that will allow you to do this.
On the other hand any novel science usually requires deep and wide exploratory research, often involving hard or flawed experimentation or observation. One can train LLM on a PhD curriculum in astrophysics, then provide that LLM with API to some new observatory and instruct it to "go prove cosmological constant". And it will do so, but the result will be generated garbage because there is no formal way to prove such results. There is no formal way to prove why pharaohs decided to stop building pyramids, despite there being some decent theories. This is science too, you know. You can't formally prove that some gene sequence is responsible for trait X etc.
I would say a majority of science is not formally provable.
And lastly, you dismiss books/texts, but that is a huge chunk of intellectual and creative work of humans. Say you are an engineer and you have a CAD model with a list of parts and parameters for rocket for example. Now you need to write a guide for it. LLM can do that, it can generate guide-looking output. The issue is that there is no way to automatically proof it or find issues in it. And there are lots of items like that.
Maybe not formally in some kind of mathematical sense. But you certainly could have simulation models of protein synthesis, and maybe even higher order simulation of tissues and organs. You could also let the ai scientist verify the experimental hypothesis by giving access to robotic lab processes. In fact it seems we are going down both fronts right now.
Basically if:
A) Scientist has an idea > triggers LLM program to sift through a ton of data > LLM print out correlation results > scientist read them and proves/disproves an idea. In this case, while LLM did a bulk of work here, it did not arrive at a breakthrough on its own.
B) LLM is idling > then LLM triggers some API to get some specific set of data > LLM correlates results > LLM prints out a complete hypothesis with proof (or disproves it). In this case we can say that LLM did a breakthrough.
Something about the whole approach is bugged.
My pet peeve: "Unix System Resources" as explanation for the /usr directory is a term that did not exist until the turn of the millenium (rumor is that a c't journalist made it up in 1999), but AI will retcon it into the FHS (5 years earlier) or into Ritchie/Thompson/Kernigham (27 years earlier).
The bug is that LLMs are fundamentally designed for natural language processing and prediction, not logic or reasoning.
We may get to actual AI eventually, but an LLM architecture either won't be involved at all or it will act as a part of the system mimicking the language center of a brain.
So it's a bit of an anti-climactic solution to the puzzle but: maybe the naysayers were right and they're not thinking at all, or doing any of the other anthropomorphic words being marketed to users, and we've simply all been dragged along by a narrative that's very seductive to tech types (the computer gods will rise!).
It'd be a boring outcome, after the countless gallons of digital ink spilled on the topic the last years, but maybe they'll come to be accepted as "normal software", and not god-like, in the end. A medium to large improvement in some areas, and anywhere from minimal to pointless to harmful in others. And all for the very high cost of all the funding and training and data-hoovering that goes in to them, not to mention the opportunity cost of all the things we humans could have been putting money into and didn't.
But most promising would be to use the Dessalles theories.
Here is 4.1o expanding this: https://chatgpt.com/s/t_6877de9faa40819194f95184979b5b44
By the way - this could be a classic example of this day dreaming - you take two texts: one by Gwern and some article by Dessalles (I read "Why we talk" - a great book! - but maybe there is some more concise article?) and ask LLM to generate ideas connecting these two. In this particular case it was my intuition that connected them - but I imagine that there could be an algorithm that could find this connection in a reasonable time - some kind of semantic search maybe.
The models are currently trained on a static set of human “knowledge” — even if they “know” what novelty is, they aren’t necessarily incentivized to identify it.
In my experience, LLMs currently struggle with new ideas, doubly true for the reasoning models with search.
What makes novelty difficult, is that the ideas should be nonobvious (see: the patent system). For example, hallucinating a simpler API spec may be “novel” for a single convoluted codebase, but it isn’t novel in the scope of humanity’s information bubble.
I’m curious if we’ll have to train future models on novelty deltas from our own history, essentially creating synthetic time capsules, or if we’ll just have enough human novelty between training runs over the next few years for the model to develop an internal fitness function for future novelty identification.
My best guess? This may just come for free in a yet-to-be-discovered continually evolving model architecture.
In either case, a single discovery by a single model still needs consensus.
Peer review?
So I strongly agree that, especially when are talking about the bulk of human discovery and invention, the incrementalism will be increasingly in striking distance of human/AI collaboration. Attribution of the novelty in these cases is going to be unclear, when the task is, simplified something like, "search for combinations of things, in this problem domain, that do the task better than some benchmark" be that drug discovery, maths, ai itself or whatever.
> Concept 1: {Chunk A} > Concept 2: {Chunk B}
In addition to the other criticisms mentioned by posters ITT, a problem I see is: What concepts do you feed it?
Obviously there's a problem with GIGO. If you don't pick the right concepts to begin with, you're not going to get a meaningful result. But, beyond that, human discovery (in mechanical engineering, at least,) tends to be massively interdisciplinary and serendipitous, so that many concepts are often involved, and many of those are necessarily non-obvious.
I guess you could come up with a biomimetics bot, but, besides that, I'm not so sure how well this concept would work as laid out above.
There's another issue in that LLMs tend to be extremely gullible, and swallow the scientific literature and University press releases verbatim and uncritically.
We're looking at our reflection and asking ourselves why it isn't moving when we don't
Of course random new things are typically bad. The article is essentially proposing to generate lots of them anyway and try to filter for only the best ones.
Given access to unlimited data, LLMs likely could spot novel trends that we cant but still cant judge the value of creating something unique that it has never encountered before.
Gwern isn't doing that here. They say: "[LLMs] lack some fundamental aspects of human thought", and then investigates that.
- are capable of evaluating the LLM's output to the degree that they can identify truly unique insights
- are prompting the LLM in such a way that it could produce truly unique insights
I've prompted an LLM upwards of 1,000 times in the last month, but I doubt more than 10 of my prompts were sophisticated enough to even allow for a unique insight. (I spend a lot of time prompting it to improve React code.) And of those 10 prompts, even if all of the outputs were unique, I don't think I could have identified a single one.
I very much do like the idea of the day-dreaming loop, though! I actually feel like I've had the exact same idea at some point (ironic) - that a lot of great insight is really just combining two ideas that no one has ever thought to combine before.
I noticed one behaviour in myself. I heard about a particular topic, because it was a dominant opinion in the infosphere. Then LLMs confirmed that dominant opinion (because it was heavily represented in the training) and I stopped my search for alternative viewpoints. So in a sense, LLMs are turning out to be another reflective mirror which reinforces existing opinion.
Infact, they're trained to please us and so in general aren't very good at pushing back. It's incredibly easy to 'beat' an LLM in an argument since they often just follow your line of reasoning (it's in the models context after all).
Setting up the map-elites dimensions may still be problem-specific but this could be learnt unsupervisedly, at least partially.
The way I see LLMs is as a search-spqce within tokens that manipulate broad concepts within a complex and not so smooth manifold. These concepts can be refined within other spaces (pixel -space, physical spaces, ...)
In examining the possibility of genuinely creative computing, Gelernter discovers and defends a model of cognition that explains so much about the human experience of creativity, including daydreaming, dreaming, everyday “aha” moments, and the evolution of human approaches to spirituality.
https://uranos.ch/research/references/Gelernter_1994/Muse%20...
This mirrors something I have thought of too. I have read multiple theories of emerging consciousness, which touch on things from proprioception to the inner monologue (which not everyone has.)
My own theory is that -- avoiding the need for an awareness of a monologue -- a LLM loop that constantly takes input and lets it run, saving key summarised parts to memory that are then pulled back in when relevant, would be a very interesting system to speak to.
It would need two loops: the constant ongoing one, and then for interaction, one accessing memories from the first. The ongoing one would be aware of the conversation. I think it would be interesting to see what, via the memory system, would happen in terms of the conversation emitting elements from the loop.
My theory is that if we're likely to see emergent consciousness, it will come through ongoing awareness and memory.
> Among Prime Intellect's four thousand six hundred and twelve interlocking programs was one Lawrence called the RANDOM_IMAGINATION_ENGINE. Its sole purpose was to prowl for new associations that might fit somewhere in an empty area of the GAT. Most of these were rejected because they were useless, unworkable, had a low priority, or just didn't make sense. But now the RANDOM_IMAGINATION_ENGINE made a critical connection, one which Lawrence had been expecting it to make [...]
> Deep within one of the billions of copies of Prime Intellect, one copy of the Random_Imagination_Engine connected two thoughts and found the result good. That thought found its way to conscious awareness, and because the thought was so good it was passed through a network of Prime Intellects, copy after copy, until it reached the copy which had arbitrarily been assigned the duty of making major decisions -- the copy which reported directly to Lawrence. [...]
> "I've had an idea for rearranging my software, and I'd like to know what you think."
> At that Lawrence felt his blood run cold. He hardly understood how things were working as it was; the last thing he needed was more changes. "Yes?"
i.e. AGI is a philosophical problem, not a scaling problem.
Though we understand them little, we know the default mode network and sleep play key roles. That is likely because they aid some universal property of AGI. Concepts we don't understand like motivation, curiosity, and qualia are likely part of the picture too. Evolution is far too efficient for these to be mere side effects.
(And of course LLMs have none of these properties.)
When a human solves a problem, their search space is not random - just like a chess grandmaster's search space of moves is not random.
How our brains are so efficient when problem solving while also able to generate novelty is a mystery.
There is a breakthrough happening. in real time.
Intuitively, it doesn't feel like scaling up to "all things in all fields" is going to produce substantial breakthroughs, if the current best-in-class implementation of the technique by the worlds leading experts returned modest results.
I am trying to answer that for myself. Since every logic is expressible in untyped lambda calculus (as any computation is), you could have a system that just somehow generates terms and beta-reduces them. In even so much simpler logic, what are the "interesting" terms?
I have several answers, but my point is, you should simplify the problem and this question has not been answered even under such simple scenario.
What the LLM companies are currently selling as "reasoning" is mostly RL-based pre-training whereby the model is encouraged to predict tokens (generate reasoning steps) according to similar "goals" seen in the RL training data. This isn't general case reasoning, but rather just "long horizon" prediction based on the training data. It helps exploit the training data, but isn't going to generate novelty outside of the deductive closure of the training data.
So how do you pick the logic in which to do reasoning? There are "good reasons" to use one logic over another.
LLMs probably learn some combination of logic rules (deduction rules in commonly used logics), but cannot guarantee they will be used consistently (i.e. choose a logic for the problem and stick to it). How do you accomplish that?
And even then reasoning is more than search. If you can reason, you should also be able to reason about more effective reasoning (for example better heuristics to cutting the search tree).
I was talking about the process/mechanism of reasoning - how do our brains appear to implement the capability that we refer to as "reasoning", and by extension how could an AI do the same by implementing the same mechanisms.
If we accept prediction (i.e use of past experience) as the mechanistic basis of reasoning, then choice of logic doesn't really come into it - it's more just a matter of your past experience and what you have learnt. What predictive rules/patterns have you learnt, both in terms of a corpus of "knowledge" you can bring to bear, but also in terms of experience with the particular problem domain - what have you learnt (i.e. what solution steps can you predict) about trying to reason about any given domain/goal ?
In terms of consistent use of logic, and sticking to it, one of the areas where LLMs are lacking is in not having any working memory other than their own re-consumed output, as well as an inability to learn beyond pre-training. With both of these capabilities an AI could maintain a focus (working memory) on the problem at hand (vs suffer from "context rot") and learn consistent, or phased/whatever, logic that has been successful in the past at solving similar problems (i.e predicting actions that will lead to solution).
Yet, the consensus seems to be we don't quite have AGI; so what gives? Clearly just making good predictions is not enough. (I would say current models are empiricist to the extreme; but there is also rationalist position, which emphasizes logical consistency over prediction accuracy.)
So, in my original comment, I lament that we don't really know what we want (what is the objective). The post doesn't clarify much either. And I claim this issue occurs with much simpler systems, such as lambda calculus, than reality-connected LLMs.
Prediction doesn't have goals - it just has inputs (past and present) and outputs (expected inputs). Something that is on your mind (perhaps a "goal") is just a predictive input that will cause you to predict what happens next.
> And I would even say that this problem (giving predictions) has been solved by RL.
Making predictions is of limited use if you don't have the feedback loop of when your predictions are right or wrong (so update prediction for next time), and having the feedback (as our brain does) of when your prediction is wrong is the basis of curiosity - causing us to explore new things and learn about them.
> Yet, the consensus seems to be we don't quite have AGI; so what gives? Clearly just making good predictions is not enough.
Prediction is important, but there are lots of things missing from LLMs such as ability to learn, working memory, innate drives (curiosity, boredom), etc.
Eventually LLM output degrades when most of the context is its own output. So should there also be an input stream of experience? The proverbial "staring out the window", fed into the model to keep it grounded and give hooks to go off?
There's a podcast I listened to ~1.5 years ago, where a team used GPT2, further trained on a bunch of related papers, and used snippets + perplexity to highlight potential errors. I remember them having some good accuracy when analysed by humans. Perhaps this could work at a larger scale? (a sort of "surprise" factor)
The breakthrough isn't in their datasets.
The feedback loop on novel/genuine breakthroughs is too long and the training data is too small.
Another reason is that there's plenty of incentive to go after the majority of the economy which relies on routine knowledge and maybe judgement, a narrow slice actually requires novel/genuine breakthroughs.
I want to remember I heard about it in several podcasts
(See original argument: https://nitter.net/dwarkesh_sp/status/1727004083113128327 )
[0] https://www.youtube.com/watch?v=5QcCeSsNRks&t=1542s
The OP's proposed solution is a constant "daydreaming loop" in which an LLM is does the following on its own, "unconsciously," as a background task, without human intervention:
1) The LLM retrieves random facts.
2) The LLM "thinks" (runs a chain-of-thought) on those retrieved facts to see if they are any interesting connections between them.
3) If the LLM finds interesting connections, it promotes them to "consciousness" (a permanent store) and possibly adds them to a dataset used for ongoing incremental training.
It could work.