> Despite impressive capabilities, large language models have yet to produce a genuine breakthrough. The puzzle is why.
I don't see why this is remotely surprising. Despite all the hoopla, LLMs are not AGI or artifical brains - they are predict-next-word language models. By design they are not built for creativity, but rather quite the opposite, they are designed to continue the input in the way best suggested by the training data - they are essentially built for recall, not creativity.
For an AI to be creative it needs to have innate human/brain-like features such as novelty (prediction failure) driven curiosity, boredom, as well as ability to learn continuously. IOW if you want the AI to be creative it needs to be able to learn for itself, not just regurgitate the output of others, and have these innate mechanisms that will cause it to pursue discovery.
ashdksnndck · 6h ago
I’m not sure we can accept the premise that LLMs haven’t made any breakthroughs. What if people aren’t giving the LLM credit when they get a breakthrough from it?
First time I got good code out of a model, I told my friends and coworkers about it. Not anymore. The way I see it, the model is a service I (or my employer) pays for. Everyone knows it’s a tool that I can use, and nobody expects me to apportion credit for whether specific ideas came from the model or me. I tell people I code with LLMs, but I don’t commit a comment saying “wow, this clever bit came from the model!”
If people are getting actual bombshell breakthroughs from LLMs, maybe they are rationally deciding to use those ideas without mentioning the LLM came up with it first.
Anyway, I still think Gwern’s suggestion of a generic idea-lab trying to churn out insights is neat. Given the resources needed to fund such an effort, I could imagine that a trading shop would be a possible place to develop such a system. Instead of looking for insights generally, you’d be looking for profitable trades. Also, I think you’d do a lot better if you have relevant experts to evaluate the promising ideas, which means that more focused efforts would be more manageable. Not comparing everything to everything, but comparing everything to stuff in the expert’s domain.
If a system like that already exists at Jane Street or something, I doubt they are going to tell us about it.
therealpygon · 31m ago
It is hard to accept as a premise because the premise is questionable from the beginning.
Google already reported several breakthroughs as a direct result of AI, using processes that almost certainly include LLMs, including a new solution in math, improved chip designs, etc. DeepMind has AI that predicted millions of protein folds which are already being used in drugs among many other things they do, though yes, not an LLM per se. There is certainly the probability that companies won’t announce things given that the direct LLM output isn’t copyrightable/patentable, so a human-in-the-loop solves the issue by claiming the human made said breakthrough with AI/LLM assistance. There isn’t much benefit to announcing how much AI helped with a breakthrough unless you’re engaged in basically selling AI.
As for “why aren’t LLMs creating breakthroughs by themselves regularly”, that answer is pretty obvious… they just don’t really have that capacity in a meaningful way based on how they work. The closest example is Google’s algorithmic breakthrough absolutely was created by a coding LLM, which was effectively achieved through brute force in a well established domain, but that doesn’t mean it wasn’t a breakthrough. That alone casts doubt on the underlying premise of the post.
nico · 45m ago
> but I don’t commit a comment saying “wow, this clever bit came from the model!”
The other day, Claude Code started adding a small signature to the commit messages it was preparing for me. It said something like “This commit was co-written with Claude Code” and a little robot emoji
I wonder if that just happened by accident or if Anthropic is trying to do something like Apple with the “sent from my iPhone”
This is bordering conspiracy theory. Thousands of people are getting novel breakthroughs generated purely by LLM an not a single person discloses such result? Not even one of the countless LLM corporation engineers who depend on the billion dollar IV injections from deluded bankers just to continue surviving, and not one has bragged about LLM doing that revolution? Hard to believe.
oidar · 11m ago
Last week, I built something like this you can try in Claude Artifacts (so the token use is cheaper). The problem is that Claude doesn't know if something is important - so everything is important. Play with it here: https://claude.ai/public/artifacts/7e9ad3de-c53c-47d8-8b60-d... - use the hamburger menu to turn on the internal loop (stream). If you improve on this please share.
zhangjunphy · 5h ago
I also hope we have something like this. But sadly, this is not going to work. The reason is this line from the article, which is so much harder that it looks:
> and a critic model filters the results for genuinely valuable ideas.
In fact, people have tryied this idea. And if you use a LLM or anything similar as the critic, the performance of the model actually degrades in this process. As the LLM tries too hard to satisfy the critic, and the critic itself is far from a good reasoner.
So the reason that we don't hear too much about this idea is not that nobody tried it. But that they tried, and it didn't work, and people are reluctant to publish about something which does not work.
imiric · 5h ago
Exactly.
This not only affects a potential critic model, but the entire concept of a "reasoning" model is based on the same flawed idea—that the model can generate intermediate context to improve its final output. If that self-generated context contains hallucinations, baseless assumptions or doubt, the final output can only be an amalgamation of that. I've seen the "thinking" output arrive at a correct solution in the first few steps, but then talk itself out of it later. Or go into logical loops, without actually arriving at anything.
The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data. There's nothing inherently better about them. There's nothing intelligent either, but that's a separate discussion.
yorwba · 4h ago
Reasoning models are trained from non-reasoning models of the same scale, and the training data is the output of the same model, filtered through a verifier. Generating intermediate context to improve the final output is not an idea that reasoning models are based on, but an outcome of the training process. Because empirically it does produce answers that pass the verifier more often if it generates the intermediate steps first.
That the model still makes mistakes doesn't mean it's not an improvement: the non-reasoning base model makes even more mistakes when it tries to skip straight to the answer.
imiric · 1h ago
Thanks. I trust that you're more familiar with the internals than myself, so I stand corrected.
I'm only speaking from personal usage experience, and don't trust benchmarks since they are often gamed, but if this process produces objectively better results that aren't achieved by scaling up alone, then that's a good thing.
danenania · 2h ago
> The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data.
Except that we can try the exact same pre-trained model with reasoning enabled vs. disabled and empirically observe that reasoning produces better, more accurate results.
imiric · 1h ago
I'm curious: can you link to any tests that prove this?
I don't trust most benchmarks, but if this can be easily confirmed by an apples-to-apples comparison, then I would be inclined to believe it.
danenania · 44m ago
Check out the DeepSeek paper.
Research/benchmarks aside, try giving a somewhat hard programming task to Opus 4 with reasoning off vs. on. Similarly, try the same with o3 vs. o3-pro (o3-pro reasons for much longer).
I'm not going to dig through my history for specific examples, but I do these kinds of comparisons occasionally when coding, and it's not unusual to have e.g. a bug that o3 can't figure out, but o3-pro can. I think this is widely accepted by engineers using LLMs to help them code; it's not controversial.
imiric · 1m ago
Huh, I wasn't aware that reasoning could be toggled. I use the OpenRouter API, and just saw that this is supported both via their web UI and API. I'm used to Sonnet 3.5 and 4 without reasoning, and their performance is roughly the same IME.
I wouldn't trust comparing two different models, even from the same provider and family, since there could be many reasons for the performance to be different. Their system prompts, training data, context size, or runtime parameters could be different. Even the same model with the same prompt could have varying performance. So it's difficult to get a clear indication that the reasoning steps are the only changing variable.
But toggling it on the same model would be a more reliable way to test this, so I'll try that, thanks.
amelius · 4h ago
But what if the critic is just hard reality? If you ask an LLM to write a computer program, instead of criticizing it, you can run it and test it. If you ask an LLM to prove a theorem, let it write the proof in a formal logic language so it can be verified. Etcetera.
zhangjunphy · 1h ago
I think if we can have a good enough simulation of reality, and a fast one. Something like an accelerable minecraft with real world physics. Then this idea might actually work.
But the hard reality we currenly could generate efficiently and feed into LLMs usually has a narrow scope. It feels liking teaching only textbook math to a kid for several years but nothing else. The LLM mostly overoptimize in these very specific fields, but the overall performance might even be worse.
dpoloncsak · 55m ago
Its gotta be G-Mod
leetbulb · 34m ago
There will never be a computer powerful enough to simulate that many paperclips and explosive barrels.
Yizahi · 2h ago
Generated code only works because "test" part (compile/validate/analyze etc.) is completely external and written before any mass-market LLMs. There is no such external validator for new theorems, books, pictures, text guides etc. You can't just run hard_reality.exe on a generated poem or a scientific paper to deem it "correct". It is only possible with programming languages, and even then not always.
amelius · 1h ago
Science is falsifiable by definition, and writing poems/books is not the kind of problem of interest here.
> There is no such external validator for new theorems
There are formal logic languages that will allow you to do this.
Yizahi · 1h ago
Your proposed approach to science would result in the extremely tiny subset of math, probably theorems being proven by automation. And it is questionable if those theorems would be even useful. A good mathematician with CS experience can probably write a generator of new useless theorems, something along "are every sequential cube plus square of a number divisible by a root of seventh smallest prime multiplied by logn of than number plus blabla...". One can generate such theorrems and formally prove or disprove them, yes.
On the other hand any novel science usually requires deep and wide exploratory research, often involving hard or flawed experimentation or observation. One can train LLM on a PhD curriculum in astrophysics, then provide that LLM with API to some new observatory and instruct it to "go prove cosmological constant". And it will do so, but the result will be generated garbage because there is no formal way to prove such results. There is no formal way to prove why pharaohs decided to stop building pyramids, despite there being some decent theories. This is science too, you know. You can't formally prove that some gene sequence is responsible for trait X etc.
I would say a majority of science is not formally provable.
And lastly, you dismiss books/texts, but that is a huge chunk of intellectual and creative work of humans. Say you are an engineer and you have a CAD model with a list of parts and parameters for rocket for example. Now you need to write a guide for it. LLM can do that, it can generate guide-looking output. The issue is that there is no way to automatically proof it or find issues in it. And there are lots of items like that.
amelius · 1h ago
I think the problem here is that you assume the LLM has to operate isolated from the world, i.e. without interaction. If you put a human scientist in isolation, then you cannot have high expectations either.
yunohn · 1h ago
IME, on a daily basis, Claude Code (supposed SoTA agent) constantly disables and bypasses tests and checks on my codebase - despite following clear prompting guidelines and all the /woo/ like ultrathink etc.
imtringued · 2h ago
That didn't stop actor-critic from becoming one of the most popular deep RL methods.
zhangjunphy · 1h ago
True, and the successful ones usually require an external source of information.
For AlphaGo, it is the simple algorithm which decide who is the winner of a game of Go. For GAN, it is the images labled by human.
In these scenarios, the critic is the medium which transforms external information into gradient which optimized the actor, but not the direct source of that information.
sartak · 36m ago
From The Metamorphosis of Prime Intellect (1994):
> Among Prime Intellect's four thousand six hundred and twelve interlocking programs was one Lawrence called the RANDOM_IMAGINATION_ENGINE. Its sole purpose was to prowl for new associations that might fit somewhere in an empty area of the GAT. Most of these were rejected because they were useless, unworkable, had a low priority, or just didn't make sense. But now the RANDOM_IMAGINATION_ENGINE made a critical connection, one which Lawrence had been expecting it to make [...]
> Deep within one of the billions of copies of Prime Intellect, one copy of the Random_Imagination_Engine connected two thoughts and found the result good. That thought found its way to conscious awareness, and because the thought was so good it was passed through a network of Prime Intellects, copy after copy, until it reached the copy which had arbitrarily been assigned the duty of making major decisions -- the copy which reported directly to Lawrence. [...]
> "I've had an idea for rearranging my software, and I'd like to know what you think."
> At that Lawrence felt his blood run cold. He hardly understood how things were working as it was; the last thing he needed was more changes. "Yes?"
js8 · 29m ago
I am not sure why tie this to any concrete AI technology such as LLMs. IMHO the biggest issue we have with AI right now is that we don't know how to philosophicaly formalize what we want. What is reasoning?
I am trying to answer that for myself. Since every logic is expressible in untyped lambda calculus (as any computation is), you could have a system that just somehow generates terms and beta-reduces them. In even so much simpler logic, what are the "interesting" terms?
I have several answers, but my point is, you should simplify the problem and this question has not been answered even under such simple scenario.
blueflow · 6h ago
I have not yet seen AI doing a critical evaluation of data sources. AI willcontradict primary sources if the contradiction is more prevalent in the training data.
Something about the whole approach is bugged.
My pet peeve: "Unix System Resources" as explanation for the /usr directory is a term that did not exist until the turn of the millenium (rumor is that a c't journalist made it up in 1999), but AI will retcon it into the FHS (5 years earlier) or into Ritchie/Thompson/Kernigham (27 years earlier).
_heimdall · 2h ago
> Something about the whole approach is bugged.
The bug is that LLMs are fundamentally designed for natural language processing and prediction, not logic or reasoning.
We may get to actual AI eventually, but an LLM architecture either won't be involved at all or it will act as a part of the system mimicking the language center of a brain.
A_D_E_P_T · 1h ago
> You are a creative synthesizer. Your task is to find deep, non-obvious,
and potentially groundbreaking connections between the two following concepts.
Do not state the obvious. Generate a hypothesis, a novel analogy,
a potential research question, or a creative synthesis.
Be speculative but ground your reasoning.
> Concept 1: {Chunk A}
> Concept 2: {Chunk B}
In addition to the other criticisms mentioned by posters ITT, a problem I see is: What concepts do you feed it?
Obviously there's a problem with GIGO. If you don't pick the right concepts to begin with, you're not going to get a meaningful result. But, beyond that, human discovery (in mechanical engineering, at least,) tends to be massively interdisciplinary and serendipitous, so that many concepts are often involved, and many of those are necessarily non-obvious.
I guess you could come up with a biomimetics bot, but, besides that, I'm not so sure how well this concept would work as laid out above.
There's another issue in that LLMs tend to be extremely gullible, and swallow the scientific literature and University press releases verbatim and uncritically.
velcrovan · 2h ago
I’m once again begging people to read David Gelernter’s 1994 book “The Muse in the Machine”. I’m surprised to see no mention of it in Gwern’s post, it’s the exact book he should be reaching for on this topic.
In examining the possibility of genuinely creative computing, Gelernter discovers and defends a model of cognition that explains so much about the human experience of creativity, including daydreaming, dreaming, everyday “aha” moments, and the evolution of human approaches to spirituality.
The models are currently trained on a static set of human “knowledge” — even if they “know” what novelty is, they aren’t necessarily incentivized to identify it.
In my experience, LLMs currently struggle with new ideas, doubly true for the reasoning models with search.
What makes novelty difficult, is that the ideas should be nonobvious (see: the patent system). For example, hallucinating a simpler API spec may be “novel” for a single convoluted codebase, but it isn’t novel in the scope of humanity’s information bubble.
I’m curious if we’ll have to train future models on novelty deltas from our own history, essentially creating synthetic time capsules, or if we’ll just have enough human novelty between training runs over the next few years for the model to develop an internal fitness function for future novelty identification.
My best guess? This may just come for free in a yet-to-be-discovered continually evolving model architecture.
In either case, a single discovery by a single model still needs consensus.
Peer review?
n4r9 · 4h ago
It's a good question. A related question is: "what's an example of something undeniably novel?". Like if you ask an agent out of the blue to prove the Collatz conjecture, and it writes out a proof or counterexample. If that happens with LLMs then I'll be a lot more optimistic about the importance to AGI. Unfortunately, I suspect it will be a lot murkier than that - many of these big open questions will get chipped away at by a combination of computational and human efforts, and it will be impossible to pinpoint where the "novelty" lies.
pilooch · 4h ago
AlphaEvolve and similar systems based on map-elites + DL/LLM + RL appears to be one of the promising paths.
Setting up the map-elites dimensions may still be problem-specific but this could be learnt unsupervisedly, at least partially.
The way I see LLMs is as a search-spqce within tokens that manipulate broad concepts within a complex and not so smooth manifold. These concepts can be refined within other spaces (pixel -space, physical spaces, ...)
johnfn · 7h ago
It's an interesting premise, but how many people
- are capable of evaluating the LLM's output to the degree that they can identify truly unique insights
- are prompting the LLM in such a way that it could produce truly unique insights
I've prompted an LLM upwards of 1,000 times in the last month, but I doubt more than 10 of my prompts were sophisticated enough to even allow for a unique insight. (I spend a lot of time prompting it to improve React code.) And of those 10 prompts, even if all of the outputs were unique, I don't think I could have identified a single one.
I very much do like the idea of the day-dreaming loop, though! I actually feel like I've had the exact same idea at some point (ironic) - that a lot of great insight is really just combining two ideas that no one has ever thought to combine before.
cantor_S_drug · 5h ago
> are capable of evaluating the LLM's output to the degree that they can identify truly unique insights
I noticed one behaviour in myself. I heard about a particular topic, because it was a dominant opinion in the infosphere. Then LLMs confirmed that dominant opinion (because it was heavily represented in the training) and I stopped my search for alternative viewpoints. So in a sense, LLMs are turning out to be another reflective mirror which reinforces existing opinion.
MrScruff · 3h ago
Yes, it seems like LLMs are system one thinking taken to the extreme. Reasoning was supposed to introduce some actual logic but you only have to play with these models for a short while to see that the reasoning tokens are a very soft constraint on the models eventual output.
Infact, they're trained to please us and so in general aren't very good at pushing back. It's incredibly easy to 'beat' an LLM in an argument since they often just follow your line of reasoning (it's in the models context after all).
LourensT · 3h ago
Regardless of accusations of anthropomorphizing, continual thinking seems to be a precursor to any sense of agency, simply because agency requires something to be running.
Eventually LLM output degrades when most of the context is its own output. So should there also be an input stream of experience? The proverbial "staring out the window", fed into the model to keep it grounded and give hooks to go off?
cranium · 4h ago
I'd be happy to spend my Claude Max tokens during the night so it can "ultrathink" some Pareto improvements to my projects. So far, I've mostly seen lateral moves that rewrites code rather than rearchitecture/design the project.
cs702 · 2h ago
The question is: How do we get LLMs to have "Eureka!" moments, on their own, when their minds are "at rest," so to speak?
The OP's proposed solution is a constant "daydreaming loop" in which an LLM is does the following on its own, "unconsciously," as a background task, without human intervention:
1) The LLM retrieves random facts.
2) The LLM "thinks" (runs a chain-of-thought) on those retrieved facts to see if they are any interesting connections between them.
3) If the LLM finds interesting connections, it promotes them to "consciousness" (a permanent store) and possibly adds them to a dataset used for ongoing incremental training.
It could work.
epcoa · 2h ago
The step 3 has been shown to not work over and over again, the “find interesting connections” is the hand wavy magic at this time. LLMs alone don’t seem to be particularly adept at it either.
cs702 · 18m ago
Has this been tried with reinforcement learning (RL)? As the OP notes, it is plausible from a RL perspective that such a bootstrap can work, because it would be (quoting the OP) "exploiting the generator-verifier gap, where it is easier to discriminate than to generate (eg laughing at a pun is easier than making it)." The hit ratio may be tiny, so doing this well would be very expensive.
dr_dshiv · 1h ago
Yes! I’ve been prototyping dreaming LLMs based on my downloaded history—and motivated by biomimetic design approaches. Just to surface ideas to myself again.
amelius · 3h ago
Humans daydream about problems when they think a problem is interesting. Can an LLM know when a problem is interesting and thereby prune the daydream graph?
zild3d · 2h ago
> The puzzle is why
The feedback loop on novel/genuine breakthroughs is too long and the training data is too small.
Another reason is that there's plenty of incentive to go after the majority of the economy which relies on routine knowledge and maybe judgement, a narrow slice actually requires novel/genuine breakthroughs.
OtherShrezzing · 5h ago
Google's effort with AlphaEvolve shows that the Daydream Factory approach might not be the big unlock we're expecting. They spent an obscene amount of compute to discover a marginal improvement over the state of the art in a very narrow field. Hours after Google published the paper, mathematicians pointed out that their SOTA algorithms underperformed compared to techniques published in the 50 years ago.
Intuitively, it doesn't feel like scaling up to "all things in all fields" is going to produce substantial breakthroughs, if the current best-in-class implementation of the technique by the worlds leading experts returned modest results.
khalic · 5h ago
Ugh, again with the anthropomorphizing. LLMs didn't come up with anything new because _they don't have agency_ and _do not reason_...
We're looking at our reflection and asking ourselves why it isn't moving when we don't
yorwba · 4h ago
If you look at your reflection in water, it may very well move even though you don't. Similarly, you don't need agency or reasoning to create something new, random selection from a large number of combinations is enough, correct horse battery staple.
Of course random new things are typically bad. The article is essentially proposing to generate lots of them anyway and try to filter for only the best ones.
RALaBarge · 48m ago
I agree that brute forcing is a method and how nature does it. The problem would still be the same, how would it or other LLMs know if the idea is novel and interesting?
Given access to unlimited data, LLMs likely could spot novel trends that we cant but still cant judge the value of creating something unique that it has never encountered before.
RALaBarge · 48m ago
Yet.
amelius · 4h ago
> anthropomorphizing
Gwern isn't doing that here. They say: "[LLMs] lack some fundamental aspects of human thought", and then investigates that.
NitpickLawyer · 6h ago
Something I haven't seen explored, but I think could perhaps help is to somehow introduce feedback regarding the generation into the context, based on things that are easily computed w/ other tools (like perplexity). In "thinking" models we see a lot of emerging behaviour like "perhaps I should, but wait, this seems wrong", etc. Perhaps adding some signals at regular? intervals could help in surfacing the correct patterns when they are needed.
There's a podcast I listened to ~1.5 years ago, where a team used GPT2, further trained on a bunch of related papers, and used snippets + perplexity to highlight potential errors. I remember them having some good accuracy when analysed by humans. Perhaps this could work at a larger scale? (a sort of "surprise" factor)
kookamamie · 2h ago
> The puzzle is why
The breakthrough isn't in their datasets.
apples_oranges · 7h ago
If the breakthrough comes, most if not all links on HN will be to machine generated content. But so far it seems that the I in current AI is https://www.youtube.com/watch?v=uY4cVhXxW64 ..
zwaps · 7h ago
Wasn't this already implemented in some agents?
I want to remember I heard about it in several podcasts
aredox · 6h ago
Oh, in the middle of "AI is PhD-level" propaganda (just check Google News to see this is not a strawman argument), some people finally admit in passing "no LLM has ever made a breakthrough".
I agree there's an equivocation going on for "PhD level" between "so smart, it could get a PhD" (as in come up with and publish new research and defend its own thesis) and "it can solve quizzes at the level that PhDs can".
washadjeffmad · 48m ago
Services that make this claim are paying people with PhDs to ask their models questions and then provide feedback on the responses with detailed reasoning.
sneak · 1h ago
Seems like an easy hypothesis to quickly smoke test with a couple hundred lines of script, a wikipedia index, and a few grand thrown at an API.
guelo · 4h ago
In a recent talk [0] Francois Chollet made it sound like all the frontier models are doing Test-Time Adaptation, which I think is a similar concept to Dynamic evaluation that Gwern says is not being done. Apparently Test-Time Adaptation encompasses several techniques some of which modify model weights and some that don't, but they are all about on-the-fly learning.
Variations on increasing compute and filtering results aside, the only way out of this rut is another breakthrough as big, or bigger than transformers. A lot of money is being spent on rebranding practical use-cases as innovation because there's severe lack of innovation in this sphere.
I don't see why this is remotely surprising. Despite all the hoopla, LLMs are not AGI or artifical brains - they are predict-next-word language models. By design they are not built for creativity, but rather quite the opposite, they are designed to continue the input in the way best suggested by the training data - they are essentially built for recall, not creativity.
For an AI to be creative it needs to have innate human/brain-like features such as novelty (prediction failure) driven curiosity, boredom, as well as ability to learn continuously. IOW if you want the AI to be creative it needs to be able to learn for itself, not just regurgitate the output of others, and have these innate mechanisms that will cause it to pursue discovery.
First time I got good code out of a model, I told my friends and coworkers about it. Not anymore. The way I see it, the model is a service I (or my employer) pays for. Everyone knows it’s a tool that I can use, and nobody expects me to apportion credit for whether specific ideas came from the model or me. I tell people I code with LLMs, but I don’t commit a comment saying “wow, this clever bit came from the model!”
If people are getting actual bombshell breakthroughs from LLMs, maybe they are rationally deciding to use those ideas without mentioning the LLM came up with it first.
Anyway, I still think Gwern’s suggestion of a generic idea-lab trying to churn out insights is neat. Given the resources needed to fund such an effort, I could imagine that a trading shop would be a possible place to develop such a system. Instead of looking for insights generally, you’d be looking for profitable trades. Also, I think you’d do a lot better if you have relevant experts to evaluate the promising ideas, which means that more focused efforts would be more manageable. Not comparing everything to everything, but comparing everything to stuff in the expert’s domain.
If a system like that already exists at Jane Street or something, I doubt they are going to tell us about it.
Google already reported several breakthroughs as a direct result of AI, using processes that almost certainly include LLMs, including a new solution in math, improved chip designs, etc. DeepMind has AI that predicted millions of protein folds which are already being used in drugs among many other things they do, though yes, not an LLM per se. There is certainly the probability that companies won’t announce things given that the direct LLM output isn’t copyrightable/patentable, so a human-in-the-loop solves the issue by claiming the human made said breakthrough with AI/LLM assistance. There isn’t much benefit to announcing how much AI helped with a breakthrough unless you’re engaged in basically selling AI.
As for “why aren’t LLMs creating breakthroughs by themselves regularly”, that answer is pretty obvious… they just don’t really have that capacity in a meaningful way based on how they work. The closest example is Google’s algorithmic breakthrough absolutely was created by a coding LLM, which was effectively achieved through brute force in a well established domain, but that doesn’t mean it wasn’t a breakthrough. That alone casts doubt on the underlying premise of the post.
The other day, Claude Code started adding a small signature to the commit messages it was preparing for me. It said something like “This commit was co-written with Claude Code” and a little robot emoji
I wonder if that just happened by accident or if Anthropic is trying to do something like Apple with the “sent from my iPhone”
> and a critic model filters the results for genuinely valuable ideas.
In fact, people have tryied this idea. And if you use a LLM or anything similar as the critic, the performance of the model actually degrades in this process. As the LLM tries too hard to satisfy the critic, and the critic itself is far from a good reasoner.
So the reason that we don't hear too much about this idea is not that nobody tried it. But that they tried, and it didn't work, and people are reluctant to publish about something which does not work.
This not only affects a potential critic model, but the entire concept of a "reasoning" model is based on the same flawed idea—that the model can generate intermediate context to improve its final output. If that self-generated context contains hallucinations, baseless assumptions or doubt, the final output can only be an amalgamation of that. I've seen the "thinking" output arrive at a correct solution in the first few steps, but then talk itself out of it later. Or go into logical loops, without actually arriving at anything.
The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data. There's nothing inherently better about them. There's nothing intelligent either, but that's a separate discussion.
That the model still makes mistakes doesn't mean it's not an improvement: the non-reasoning base model makes even more mistakes when it tries to skip straight to the answer.
I'm only speaking from personal usage experience, and don't trust benchmarks since they are often gamed, but if this process produces objectively better results that aren't achieved by scaling up alone, then that's a good thing.
Except that we can try the exact same pre-trained model with reasoning enabled vs. disabled and empirically observe that reasoning produces better, more accurate results.
I don't trust most benchmarks, but if this can be easily confirmed by an apples-to-apples comparison, then I would be inclined to believe it.
Research/benchmarks aside, try giving a somewhat hard programming task to Opus 4 with reasoning off vs. on. Similarly, try the same with o3 vs. o3-pro (o3-pro reasons for much longer).
I'm not going to dig through my history for specific examples, but I do these kinds of comparisons occasionally when coding, and it's not unusual to have e.g. a bug that o3 can't figure out, but o3-pro can. I think this is widely accepted by engineers using LLMs to help them code; it's not controversial.
I wouldn't trust comparing two different models, even from the same provider and family, since there could be many reasons for the performance to be different. Their system prompts, training data, context size, or runtime parameters could be different. Even the same model with the same prompt could have varying performance. So it's difficult to get a clear indication that the reasoning steps are the only changing variable.
But toggling it on the same model would be a more reliable way to test this, so I'll try that, thanks.
> There is no such external validator for new theorems
There are formal logic languages that will allow you to do this.
On the other hand any novel science usually requires deep and wide exploratory research, often involving hard or flawed experimentation or observation. One can train LLM on a PhD curriculum in astrophysics, then provide that LLM with API to some new observatory and instruct it to "go prove cosmological constant". And it will do so, but the result will be generated garbage because there is no formal way to prove such results. There is no formal way to prove why pharaohs decided to stop building pyramids, despite there being some decent theories. This is science too, you know. You can't formally prove that some gene sequence is responsible for trait X etc.
I would say a majority of science is not formally provable.
And lastly, you dismiss books/texts, but that is a huge chunk of intellectual and creative work of humans. Say you are an engineer and you have a CAD model with a list of parts and parameters for rocket for example. Now you need to write a guide for it. LLM can do that, it can generate guide-looking output. The issue is that there is no way to automatically proof it or find issues in it. And there are lots of items like that.
> Among Prime Intellect's four thousand six hundred and twelve interlocking programs was one Lawrence called the RANDOM_IMAGINATION_ENGINE. Its sole purpose was to prowl for new associations that might fit somewhere in an empty area of the GAT. Most of these were rejected because they were useless, unworkable, had a low priority, or just didn't make sense. But now the RANDOM_IMAGINATION_ENGINE made a critical connection, one which Lawrence had been expecting it to make [...]
> Deep within one of the billions of copies of Prime Intellect, one copy of the Random_Imagination_Engine connected two thoughts and found the result good. That thought found its way to conscious awareness, and because the thought was so good it was passed through a network of Prime Intellects, copy after copy, until it reached the copy which had arbitrarily been assigned the duty of making major decisions -- the copy which reported directly to Lawrence. [...]
> "I've had an idea for rearranging my software, and I'd like to know what you think."
> At that Lawrence felt his blood run cold. He hardly understood how things were working as it was; the last thing he needed was more changes. "Yes?"
I am trying to answer that for myself. Since every logic is expressible in untyped lambda calculus (as any computation is), you could have a system that just somehow generates terms and beta-reduces them. In even so much simpler logic, what are the "interesting" terms?
I have several answers, but my point is, you should simplify the problem and this question has not been answered even under such simple scenario.
Something about the whole approach is bugged.
My pet peeve: "Unix System Resources" as explanation for the /usr directory is a term that did not exist until the turn of the millenium (rumor is that a c't journalist made it up in 1999), but AI will retcon it into the FHS (5 years earlier) or into Ritchie/Thompson/Kernigham (27 years earlier).
The bug is that LLMs are fundamentally designed for natural language processing and prediction, not logic or reasoning.
We may get to actual AI eventually, but an LLM architecture either won't be involved at all or it will act as a part of the system mimicking the language center of a brain.
> Concept 1: {Chunk A} > Concept 2: {Chunk B}
In addition to the other criticisms mentioned by posters ITT, a problem I see is: What concepts do you feed it?
Obviously there's a problem with GIGO. If you don't pick the right concepts to begin with, you're not going to get a meaningful result. But, beyond that, human discovery (in mechanical engineering, at least,) tends to be massively interdisciplinary and serendipitous, so that many concepts are often involved, and many of those are necessarily non-obvious.
I guess you could come up with a biomimetics bot, but, besides that, I'm not so sure how well this concept would work as laid out above.
There's another issue in that LLMs tend to be extremely gullible, and swallow the scientific literature and University press releases verbatim and uncritically.
In examining the possibility of genuinely creative computing, Gelernter discovers and defends a model of cognition that explains so much about the human experience of creativity, including daydreaming, dreaming, everyday “aha” moments, and the evolution of human approaches to spirituality.
https://uranos.ch/research/references/Gelernter_1994/Muse%20...
The models are currently trained on a static set of human “knowledge” — even if they “know” what novelty is, they aren’t necessarily incentivized to identify it.
In my experience, LLMs currently struggle with new ideas, doubly true for the reasoning models with search.
What makes novelty difficult, is that the ideas should be nonobvious (see: the patent system). For example, hallucinating a simpler API spec may be “novel” for a single convoluted codebase, but it isn’t novel in the scope of humanity’s information bubble.
I’m curious if we’ll have to train future models on novelty deltas from our own history, essentially creating synthetic time capsules, or if we’ll just have enough human novelty between training runs over the next few years for the model to develop an internal fitness function for future novelty identification.
My best guess? This may just come for free in a yet-to-be-discovered continually evolving model architecture.
In either case, a single discovery by a single model still needs consensus.
Peer review?
Setting up the map-elites dimensions may still be problem-specific but this could be learnt unsupervisedly, at least partially.
The way I see LLMs is as a search-spqce within tokens that manipulate broad concepts within a complex and not so smooth manifold. These concepts can be refined within other spaces (pixel -space, physical spaces, ...)
- are capable of evaluating the LLM's output to the degree that they can identify truly unique insights
- are prompting the LLM in such a way that it could produce truly unique insights
I've prompted an LLM upwards of 1,000 times in the last month, but I doubt more than 10 of my prompts were sophisticated enough to even allow for a unique insight. (I spend a lot of time prompting it to improve React code.) And of those 10 prompts, even if all of the outputs were unique, I don't think I could have identified a single one.
I very much do like the idea of the day-dreaming loop, though! I actually feel like I've had the exact same idea at some point (ironic) - that a lot of great insight is really just combining two ideas that no one has ever thought to combine before.
I noticed one behaviour in myself. I heard about a particular topic, because it was a dominant opinion in the infosphere. Then LLMs confirmed that dominant opinion (because it was heavily represented in the training) and I stopped my search for alternative viewpoints. So in a sense, LLMs are turning out to be another reflective mirror which reinforces existing opinion.
Infact, they're trained to please us and so in general aren't very good at pushing back. It's incredibly easy to 'beat' an LLM in an argument since they often just follow your line of reasoning (it's in the models context after all).
Eventually LLM output degrades when most of the context is its own output. So should there also be an input stream of experience? The proverbial "staring out the window", fed into the model to keep it grounded and give hooks to go off?
The OP's proposed solution is a constant "daydreaming loop" in which an LLM is does the following on its own, "unconsciously," as a background task, without human intervention:
1) The LLM retrieves random facts.
2) The LLM "thinks" (runs a chain-of-thought) on those retrieved facts to see if they are any interesting connections between them.
3) If the LLM finds interesting connections, it promotes them to "consciousness" (a permanent store) and possibly adds them to a dataset used for ongoing incremental training.
It could work.
The feedback loop on novel/genuine breakthroughs is too long and the training data is too small.
Another reason is that there's plenty of incentive to go after the majority of the economy which relies on routine knowledge and maybe judgement, a narrow slice actually requires novel/genuine breakthroughs.
Intuitively, it doesn't feel like scaling up to "all things in all fields" is going to produce substantial breakthroughs, if the current best-in-class implementation of the technique by the worlds leading experts returned modest results.
We're looking at our reflection and asking ourselves why it isn't moving when we don't
Of course random new things are typically bad. The article is essentially proposing to generate lots of them anyway and try to filter for only the best ones.
Given access to unlimited data, LLMs likely could spot novel trends that we cant but still cant judge the value of creating something unique that it has never encountered before.
Gwern isn't doing that here. They say: "[LLMs] lack some fundamental aspects of human thought", and then investigates that.
There's a podcast I listened to ~1.5 years ago, where a team used GPT2, further trained on a bunch of related papers, and used snippets + perplexity to highlight potential errors. I remember them having some good accuracy when analysed by humans. Perhaps this could work at a larger scale? (a sort of "surprise" factor)
The breakthrough isn't in their datasets.
I want to remember I heard about it in several podcasts
(See original argument: https://nitter.net/dwarkesh_sp/status/1727004083113128327 )
[0] https://www.youtube.com/watch?v=5QcCeSsNRks&t=1542s