Why do LLMs have emergent properties?

64 Bostonian 79 5/8/2025, 8:07:00 PM johndcook.com ↗

Comments (79)

anon373839 · 3h ago
I remain skeptical of emergent properties in LLMs in the way that people have used that term. There was a belief 3-4 years ago that if you just make the models big enough, they magically acquire intelligence. But since then, we’ve seen that the models are actually still pretty limited by the training data: like other ML models, they interpolate well between the data they’ve been trained on, but they don’t generalize well beyond it. Also, we have seen models that are 50-100x smaller now exhibit the same “emergent” capabilities that were once thought to require hundreds of billions of parameters. I personally think the emergent properties really belong to the data instead.
andy99 · 3h ago
Yes, deep learning models only interpolate, and essentially represent an effective way of storing data labeling effort. Doesn't mean they're not useful, just not what tech adjacent promoters want people to think.
john-h-k · 3h ago
> Yes, deep learning models only interpolate

What do you mean by this? I don’t think the understanding of LLMs is sufficient to make this claim

andy99 · 3h ago
An LLM is a classifier, there is lots of research into how deep learning classifiers work, that I haven't seen contradicted when applied to LLMs.
kevinsync · 2h ago
My hot take is that what some people are labeling as "emergent" is actually just "incidental encoding" or "implicit signal" -- latent properties that get embedded just by nature of what's being looked at.

For instance, if you have a massive tome of English text, a rather high percentage of it will be grammatically-correct (or close), syntactic and understandable, because humans who speak good English took the time to write it and wrote it how other humans would expect to read or hear it. This, by its very nature, embeds "English language" knowledge due to sequence, word choice, normally-hard-to-quantify expressions (colloquial or otherwise), etc.

When you consider source data from many modes, there's all kinds of implicit stuff that gets incidentally written.. for instance, real photographs of outer space or deep sea would only show humans in protective gear, not swimming next to the Titanic. Conversely, you won't see polar bears eating at Chipotle, or giant humans standing on top of mountains.

There's a statistical probability of "this showed up enough in the training data to loosely confirm its existence" / "can't say I ever saw that, so let's just synthesize it" aspect of the embeddings that one person could interpret as "emergent intelligence", while another could just-as-convincingly say it's probabilistic output that is mostly in line with what we expect to receive. Train the LLM on absolute nonsense instead and you'll receive exactly that back.

gond · 3h ago
Interesting. Is there a quantitative threshold to emergence anyone could point at with these smaller models? Tracing the thoughts of a large language model is probably the only way to be sure, or is it?
zmmmmm · 2h ago
This seems superficial and doesn't really get to the heart of the question. To me it's not so much about bits and parameters but a more interesting fundamental question of whether pure language itself is enough to encompass and encode higher level thinking.

Empirically we observe that an LLM trained purely to predict a next token can do things like solve complex logic puzzles that it has never seen before. Skeptics claim that actually the network has seen at least analogous puzzles before and all it is doing is translating between them. However the novelty of what can be solved is very surprising.

Intuitively it makes sense that at some level, that intelligence itself becomes a compression algorithm. For example, you can learn separately how to solve every puzzle ever presented to mankind, but that would take a lot of space. At some point it's more efficient to just learn "intelligence" itself and then apply that to the problem of predicting the next token. Once you do that you can stop trying to store an infinite database of parallel heuristics and just focus the parameter space on learning "common heuristics" that apply broadly across the problem space, and then apply that to every problem.

The question is, at what parameter count and volume of training data does the situation flip to favoring "learning intelligence" rather than storing redundant domain specialised heuristics? And is it really happening? I would have thought just looking at the activation patterns could tell you a lot, because if common activations happen for entirely different problem spaces then you can argue that the network has to be learning common abstractions. If not, maybe it's just doing really large scale redundant storage of heurstics.

disambiguation · 35m ago
Good take, but while we're invoking intuition, something is clearly missing in the fundamental design given real brains don't need to consume all the worlds literature before demonstrating intelligence. There's some missing piece w.r.t self learning and sense making. The path to emergent reasoning you lay out is interesting and might happen anyway as we scale up, but the original idea was to model these algorithms in our own image in the first place - I wonder if we won't discover that missing piece first.
gyomu · 3h ago
The reasoning in the article is interesting, but this struck me as a weird example to choose:

> “The real question is how can we predict when a new LLM will achieve some new capability X. For example, X = “Write a short story that resonates with the social mood of the present time and is a runaway hit”

Framing a capability as something that is objectively measurable (“able to perform math on the 12th grade level”, “able to write a coherent, novel text without spelling/grammar mistakes”) makes sense within the context of what the author is trying to demonstrate.

But the social proof aspect (“is a runaway hit”) feels orthogonal to it? Things can be runaway hits for social factors independently of the capability they actually represent.

creer · 37s ago
That it seems hard (impossible) or not clear intuitively how to go about it, to us humans, is what makes the question interesting. In a way. The other questions are interesting but a different class of interesting. At any rate, both good for this question.
Retric · 2h ago
It’s not about being “a runaway hit” as an objective measurement it’s about the things an LLM would need to achieve before that was possible. At first AI scores on existing tests seemed like a useful metric. However, tests designed for humans make specific assumptions that don’t apply to these systems making such tests useless.

AI is very good at gaming metrics so it’s difficult to list some criteria where achieving it is meaningful. A hypothetical coherent novel without spelling/grammar mistakes could in effect be a copy of some existing work that shows up in its corpus, however a hit requires more than a reskinned story.

interstice · 3h ago
The bag of heuristics thing is interesting to me, is it not conceivable that a NN of a certain size trained only on math problems would be able to wire up what amounts to a calculator? And if so, could that form part of a wider network, or is I/O from completely different modalities not really possible in this way?
juancn · 4h ago
I always wondered if the specific dimensionality of the layers and tensors has a specific effect on the model.

It's hard to explain, but higher dimensional spaces have weird topological properties, not all behave the same way and some things are perfectly doable in one set of dimensions while for others it just plain doesn't work (e.g. applying surgery on to turn a shape into another).

etrautmann · 3h ago
How is topology specifically related to emergent capabilities in AI?
lordnacho · 4h ago
What seems a bit miraculous to me is, how did the researchers who put us on this path come to suspect that you could just throw more data and more parameters at the problem? If the emergent behavior doesn't appear for moderate sized models, how do you convince management to let you build a huge model?
TheCoreh · 4h ago
This is perhaps why it took us this long to get to LLMs, the underlying math and ideas were (mostly) there, and even if the Transformer as an architecture wasn't ready yet, it wouldn't surprise me if throwing sufficient data/compute at a worse architecture wouldn't also produce comparable emergent behavior

There needed to be someone willing to try going big at an organization with sufficient idle compute/data just sitting there, not a surprise it first happened at Google.

hibikir · 4h ago
But we got here step by step, as other interesting use cases came up by using somewhat less compute. Image recognition, early forms of image generation, AlphaGo, AlphaZero for chess. All earlier forms of deep neural networks that are much more reasonable than training a top of the line LLM today, but seemed expensive at the time. And ultimately a lot of this also comes from the hardware advancements and the math advancements. If you took classes neural networks in the 1990s, you'd notice that they mostly talked about 1 or 2 hidden layers, and not all that much focus on the math to train large networks, precisely because of how daunting the compute costs were for anything that wasn't a toy. But then came video card hardware, and improvements to use it to do gradient descent, making going past silly 3 layer networks somewhat reasonable.

Every bet makes perfect sense after you consider how promising the previous one looked, and how much cheaper the compute was getting. Imagine being tasked to train an LLM in 1995: All the architectural knowledge we have today and a state-level mandate would not have gotten all that far. Just the amount of fast memory that we put to bear wouldn't have been viable until relatively recently.

pixl97 · 1h ago
> and how much cheaper the compute was getting.

I remember back in the 90s how scientists/data analysts were saying that we'd need exaflop scale systems to tackle certain problems. I remember thinking how foreign that number was when small systems were running maybe tens of megaFLOPS. Now we have systems starting to zettaflops (FP8 so not exact comparison).

educasean · 4h ago
Al-Khwarizmi · 3h ago
While GPT-2 didn't show emergent abilities, it did show improved accuracy on various tasks with respect to GPT-1. At that point, it was clear that scaling made sense.

In other words, no one expected GPT-3 to suddenly start solving tasks without training as it did, but it was expected to be useful as an incremental improvement to what GPT-2 did. At the time, GPT-2 was seeing practical use, mainly in text generation from some initial words - at that point the big scare was about massive generation of fake news - and also as a model that one could fine-tune for specific tasks. It made sense to train a larger model that would do all that better. The rest is history.

prats226 · 4h ago
I don't think model sizes increased suddenly, there might not be emergent properties for certain tasks at smaller scales but there was improvement at slower rate for sure. Competition to improve that metric albeit at lower pace led to slow increase in model sizes and by chance led to emergence the way its defined in paper?
gessha · 2h ago
There’s that Sinclair quote:

It Is Difficult to Get a Man to Understand Something When His Salary Depends Upon His Not Understanding It

andy99 · 5h ago
I didn't follow entirely on a fast read, but this confused me especially:

  The parameter count of an LLM defines a certain bit budget. This bit budget must be spread across many, many tasks
I'm pretty sure that LLMs, like all big neural networks, are massively under-specified, as in there are way more parameters than data to fit (understanding the training data set is bigger than the size of the model, but the point is the same loss can be achieved with many different combinations of parameters).

And I think of this underspecification as the reason neural networks extrapolate cleanly and this generalize.

vonneumannstan · 4h ago
This doesn't seem right and most people recognize that 'neurons' encode for multiple activations. https://transformer-circuits.pub/2022/toy_model/index.html
waynecochran · 4h ago
Since gradient descent converges on a local minima, would we expect different emergent properties with different initialization of the weights?
jebarker · 4h ago
Not significantly, as I understand it. There's certainly variation in LLM abilities with different initializations but the volume and content of the data is a far bigger determinant of what an LLM will learn.
waynecochran · 4h ago
So there is an "attractor" that different initializations end up converging on?
andy99 · 3h ago
Different initialization converge to different places, e.g https://arxiv.org/abs/1912.02757

For LLMs (as with other models), many local optima appear to support roughly the same behavior. This is the idea of the problem being under-specified ie many more equations than unknowns so there are many ways to get the same result.

cratermoon · 5h ago
Alternate view: Are Emergent Abilities of Large Language Models a Mirage? https://arxiv.org/abs/2304.15004

"Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance."

K0balt · 5h ago
A decent thought-proxy for this : powered flight.

An aircraft can approach powered flight without achieving it. With a given amount of thrust or aerodynamic characteristics, the aircraft will weigh dynamic_weight=(static_weight - x) where x is a combination of the aerodynamic characteristics and the amount of thrust applied.

In no case where dynamic_weight>0 will the aircraft fly, even though it exhibits characteristics of flight, I.e the transfer of aerodynamic forces to counteract gravity.

So while it progressively exhibits characteristics of flight, it is not capable of any kind of flight at all until the critical point of dynamic_weight<0. So the enabling characteristics are not “emergent”, but the behavior is.

I think this boils down to a matter of semantics.

scopemouthwash · 4h ago
“Thought-proxy”?

I think the word you’re looking for is “analogy”.

Al-Khwarizmi · 3h ago
The continuous metrics the paper uses are largely irrelevant in practice, though. The sudden changes appear when you use metrics people actually care about.

To me the paper is overhyped. Knowing how neural networks work, it's clear that there are going to be underlying properties that vary smoothly. This doesn't preclude the existence of emergent abilities.

jebarker · 4h ago
Yes, this paper is under-appreciated. The point is that we as humans decide what constitutes a given task we're going to set as a bar and it turns out that statistical pattern matching can solve many of those tasks to a reasonable level (we also get to define "reasonable") when there's sufficient scale of parameters and data, but that tip-over point is entirely arbitrary.
foobarqux · 5h ago
The author himself explicitly acknowledges the paper but the incomprehensibly ignores it ("Even so, many would like to understand, predict, and even facilitate the emergence of these capabilities."). It's like saying "some say [foo] doesn't exist but even so many would like to understand [foo]". It's incoherent.
autoexec · 5h ago
No point in letting facts get in the way of an entire article I guess.
moffkalast · 5h ago
That has been a problem with most LLM benchmarks. Any test that's rated in percentages tends to be logarithmic, since getting from say 90% to 95% is not a linear 5% improvement but probably more like a 2x or 10x improvement in practical terms, since the metric is already nearly maxed out and only the extreme edge cases remain that are much harder to master.
me3meme · 5h ago
Metaphor: finding a path from a initial point to a destination in a graph. As the number of parameters increases one can expect the LLM to be able to remember how to go from one place to another and in the end it should be able to find a long path. This can be an emergent property since with less parameters the LLM could not be able to find the correct path. Now one has to find what kind of problems this metaphor is a good model of.

No comments yet

Michelangelo11 · 5h ago
How could they not?

Emergent properties are unavoidable for any complex system and probably exponentially scale with complexity or something (I'm sure there's an entire literature about this somewhere).

One good instance are spandrels in evolutionary biology. The wikipedia article is a good explanation of the subject: https://en.m.wikipedia.org/wiki/Spandrel_(biology)

No comments yet

samirillian · 3h ago
*Do
nthingtohide · 4h ago
What do you think about this analogy?

A simple process produces a Mandelbrot set. A simple process (loss minimization through gradient descent) produces LLMs. So what plays the role of 2D-plane or dense point grid in the case of LLMs? It is the embeddings, (or ordered combinations of embeddings ) which are generated after pre-training. In case of a 2D plan, the closeness between two points is determined by our numerical representation schema. But in case of embeddings, we learn the 2D-grid of words (playing the role of points) by looking at how the words are getting used in corpus

The following is a quote from Yuri Manin, an eminent Mathematician.

https://www.youtube.com/watch?v=BNzZt0QHj9U Of the properties of mathematics, as a language, the most peculiar one is that by playing formal games with an input mathematical text, one can get an output text which seemingly carries new knowledge. The basic examples are furnished by scientific or technological calculations: general laws plus initial conditions produce predictions, often only after time-consuming and computer-aided work. One can say that the input contains an implicit knowledge which is thereby made explicit.

I have a related idea which I picked up from somewhere which mirrors the above observation.

When we see beautiful fractals generated by simple equations and iterative processes, we give importance to only the equations, not to the cartesian grid on which that process operates.

pixl97 · 51m ago
> the most peculiar one is that by playing formal games with an input mathematical text, one can get an output text which seemingly carries new knowledge.

Or biologically, DNA/RNA behaves in a similar manner.

OtherShrezzing · 5h ago
It feels like this can be tracked with addition. Humans expect “can do addition” is a binary skill, because humans either can or cannot add.

LLMs approximate addition. For a long time they would produce hot garbage. Then after a lot of training, they could sum 2 digit numbers correctly.

At this point we’d say “they can do addition”, and the property has emerged. They have passed a binary skill threshold.

skydhash · 5h ago
Or you could cobble up a small electronic circuit or a mechanical apparatus and have something that can add numbers.
unsupp0rted · 5h ago
Isn't "emergent properties" another way to say "we're not very good at understanding the capabilities of complex systems"?
bunderbunder · 4h ago
I've always understood it more to mean, "phenomena that happen due to the interactions of a system's parts without being explicitly encoded into their individual behavior." Fractal patterns in nature are a great example of emergent phenomena. A single water molecule contains no explicit plan for how to get together with its buddies and make spiky hexagon shapes when they get cold.

And I've always understood talking about emergence as if it were some sort of quasi-magical and unprecedented new feature of LLMs to mean, "I don't have a deep understanding of how machine learning works." Emergent behavior is the entire point of artificial neural networks, from the latest SOTA foundation model all the way back to the very first tiny little multilayer perceptron.

andy99 · 4h ago
Emergence in the context of LLMs is really just us learning that "hey, you don't actually need intelligence to do <task>, turns out it can be done using a good enough next token predictor. We're basically learning what intelligence isn't as we see some of the things these models can do.

I always understood this to be the initial framing, e.g. in the Language Models are Few Shot Learners paper but then it got flipped around.

dmd · 4h ago
Or maybe you need intelligence to be a good enough next token predictor. Maybe the thing that “just” predicts the next token can be called “intelligence”.
Workaccount2 · 4h ago
The challenge there would be showing that humans have this thing called intelligence. You yourself are just outputting ephemeral actions that rise out of your subconscious. We have no idea what that system feeding our output looks like (except it's some kind of organic neural net) and hence there isn't really a basis for discriminating what is and isn't intelligent besides "if it solves problems, it has some degree of intelligence"
PaulDavisThe1st · 3h ago
To return an old but still good analogy ...

If you want to understand how birds fly, the fact that planes also fly is near useless. While a few common aerodynamic principles apply, both types of flight are so different from each other that you do not learn very much about one from the other.

On the other hand, if your goal is just "humans moving through the air for extended distances", it doesn't matter at all that airplanes do not fly the way birds do.

And then, on the generated third hand, if you need the kind of tight quarters maneuverability that birds can do in forests and other tangled spaces, then the way our current airplanes fly is of little to no use at all, and you're going to need a very different sort of technology than the one used in current aircraft.

And on the accidentally generated fourth hand, if your goal is "moving very large mass over very long distance", the the mechanisms of bird flight are likely to be of little utility.

The fact that two different systems can be described in a similar way (e.g. "flying") doesn't by itself tell you that they are working in remotely the same way or capable of the same sorts of things.

pixl97 · 26m ago
doesn't by itself tell you that they are working in remotely the same way or capable of the same sorts of things.

I believe any intelligence that reaches 'human level' should be capable of nearly the same things with tool use, the fact it accomplishes the goal in a different way doesn't matter because the systems behavior is generalized. Hence the term (artificial) general intelligence. Two different general intelligences built on different architectures should be able to converge on similar solutions (for example solutions based on lowest energy states) because they are operating in the same physical realm.

An AGI and an HGI should be able to have convergent solutions for fast air travel, ornithopters, and drones.

Workaccount2 · 3h ago
I think that many birds gets too sensitive when discussing what "flight" means, heh
andy99 · 2h ago
A better bird analogy would be if we didn't understand at all how flight worked, and then started throwing rocks and had pseudo-intellectuals saying "how do we know that isn't all that flight is, we've clearly invented artificial flight".
prats226 · 4h ago
If we use some metric as proxy for intelligence, emergence simply means a non-linear sudden change in that metric?
HPsquared · 4h ago
Or more generally "fitting a model to data".
esafak · 4h ago
Not quite. Complex systems can exhibit macroscopic properties not evident at microscopic scales. For example, birds self organize into flocks, an emergent phenomenon, visible to the untrained eye. Our understanding of how it happens does not change the fact that it does.

There is a field of study for this called statistical mechanics.

https://ganguli-gang.stanford.edu/pdf/20.StatMechDeep.pdf

HPsquared · 4h ago
Very interesting crossover!
anonymars · 4h ago
See also: stigmergy
theobreuerweil · 5h ago
I understood it to mean properties of large-scale systems that are not properties of its components. Like in thermodynamics: zooming in to a molecular level, you can reverse time without anything seeming off. Suddenly you get a trillion molecules and things like entropy appear, and time is not reversible at all.
gond · 4h ago
Not at all. Here is an analogy: A car is a system which brings you from point A to B. No part of the car can bring you from point A to B. Not the seats, the wheels, not the frame, not even the motor. If you put the motor on a table, it won’t move one bit. The car, as a system, however does. The emergent property of a car, seen as a system, is that it brings you from one location to another.

A system is the product of the interaction of its parts. It is not the sum of the behaviour of its parts. If a system does not exhibit some form of emergent behaviour, it is not a system, but something else. Maybe an assembly.

unsupp0rted · 4h ago
That sounds like semantics.

If putting together a bunch of X's in a jar always makes the jar go Y, then is Y an emergent property?

Or we need to better understand why a bunch of X's in a jar do that, and then the property isn't emergent anymore, but rather the natural outcome of well-understood X's in a well-understood jar.

gond · 3h ago
Ah. Not semantics, that is cybernetics and systems theory.

As in your example: If a bunch of x in a jar leads to the jar tipping over, it is not emergent. That’s just cause and effect. Problem to start with is that the jar containing x is not even a system in the first place, emergence as a concept is not applicable here.

There may be a misunderstanding on your side of the term emergence. Emergence does not equal non-understanding or some spooky-hooky force coming from the unknown. We understand the functions of the elements of a car quite well. The emergent behaviour of a car was intentionally brought about by massive engineering.

Reductionism does not lead to an explaining-away of emergence.

cluckindan · 4h ago
tinix · 3h ago
haha cool!

turned the car into a motorcycle.

here's an article with a photo for anyone who's interested: https://archive.is/y96xb

seliopou · 5h ago
Yes, it’s a cop-out and smells mostly of dualism: https://plato.stanford.edu/entries/properties-emergent/
jfengel · 4h ago
It's more specific than that. Most complex systems just produce noise. A few complex systems produce behavior that we perceive as simple. This is surprising, and gets the name "emergent".
tunesmith · 5h ago
It just means they haven't modeled the externalities. A plane on the ground isn't emergent. In the air it is, at least until you perfectly model weather, which you can't do, so its behavior is emergent. But I think a plane is also a good comparison because it shows that you can manage it; we don't have to perfectly model weather to still have fairly predictable air travel.
scopemouthwash · 4h ago
The authors haven’t demonstrated emergence of LLMs. If I write a piece of code and it does what I programmed it to do that’s not emergence. LLMs aren’t doing anything unexpected yet. I think that’s the smell test because emergence is still subjective.
pixl97 · 21m ago
Are you writing the neural networks for LLMs?
IshKebab · 4h ago
They weren't trying to demonstrate it. They were explaining why it might not be surprising.
teekert · 4h ago
Perhaps we should ask: Why do humans pick arbitrary points on a continuum beyond which things are labeled “emergent”?

No comments yet

chasing0entropy · 5h ago
There are eerie similarities in radiographs of LLM inference output and mammalian EEGs. I would be surprised not see latent and surprisingly complicated characteristics become apparent as context and recursive algorithms grow larger.
aeonik · 4h ago
What graphs are you talking about? I've never heard of LLM radiographs, and my searches are coming up empty.
RigelKentaurus · 3h ago
I'm not a techie, so perhaps someone can help me understand this: AFAIK, no theoretical computer scientist predicted emergence in AI models. Doesn't that suggest that the field of theoretical computer science (or theoretical AI, if you will) is suspect? It's like Lord Kelvin saying that heavier-than-air flying machines are impossible a decade before the Wright brothers' first flight.
chasd00 · 3h ago
I’m not even clear on the AI def of “emergent behavior”. The AI crowd mixes in terms and concepts from biology to describe things that are dramatically more simple. For example, using “neuron” to really mean a formula calculation or function. Neurons are a lot more than that and not even understood completely to begin with however developers use the term as if they have neurons implemented in software.

Maybe it’s a variation of the “assume a frictionless spherical horse” problem but it’s very confusing.

xboxnolifes · 3h ago
Has emergent behavior ever been predicted prior to it being observed in other theoretical fields?
pixl97 · 12m ago
I believe it's been predicted in traffic planning and highway design and tested in via simulation and in field experiments. Use of self driving cars to modify traffic behaviors and decrease traffic jams is a field of study these days.
tinix · 3h ago
emergent behavior is common in all large systems.

it doesn't seem that surprising to me.

lutusp · 3h ago
> Doesn't that suggest that the field of theoretical computer science (or theoretical AI, if you will) is suspect?

Consider the story of Charles Darwin, who knew evolution existed, but who was so afraid of public criticism that he delayed publishing his findings so long that he nearly lost his priority to Wallace.

For contrast, consider the story of Alfred Wegener, who aggressively promoted his idea of (what was later called) plate tectonics, but who was roundly criticized for his radical idea. By the time plate tectonics was tested and proven, Wegener was long gone.

These examples suggest that, in science, it's not the claims you make, it's the claims you prove with evidence.