Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens

83 nyrikki 32 5/23/2025, 4:13:43 PM arxiv.org ↗

Comments (32)

ngruhn · 2h ago
Man that "Unreasonable Effectiveness of ..." pattern is getting a bit overused. With the original paper [1] you could still say that there really is some deeply philosophical mystery. But they now slap that on everything.

[1] https://en.m.wikipedia.org/wiki/The_Unreasonable_Effectivene...

gipp · 50m ago
Engineering blogger's love of parroting the titles of famous papers/articles (unreasonable effectiveness..., all you need is..., ... Considered harmful, etc) has always been lightly annoying to me
jorvi · 11m ago
With software engineering, every single thing in the 2010s had "syntactic sugar" and "sane defaults". I still get a slight blood pressure spike whenever someone uses either of those terms.
airza · 14m ago
It’s just not that common for the same person to have serious engineering chops and writing abilities.
jvanderbot · 1h ago
In this case it's probably more a pun (intentional or not I guess) about "reasonless" or "unreason"
EGreg · 36m ago
Would you go further, and say that Unreasonable Effectiveness… is considered harmful?
MeteorMarc · 1h ago
What is not unreasonable about intermediate tokens without reason? See the abstract.
godelski · 51m ago
It's also worth pointing out that Winger's (position) paper[0] is really about something that would sound silly today. He's arguing that we should use math to drive physics. Today, many people think these are indistinguishable things and you'll get into arguments about math being invented or discovered. But Winger is talking about how mathematics provides us with a framework where we can drive physics forward through theory instead of relying purely upon experimentation to poke and prod the universe.

It is rather "unreasonable" to think we can explore the world simply through pen and paper, from the comfort of a chair. You'd think you'd need to go out and touch grass, but incredibly this is not necessary.

  | The first point is that the enormous usefulness of mathematics in the natural sciences is something bordering on the mysterious and that there is no rational explanation for it. Second, it is just this uncanny usefulness of mathematical concepts that raises the question of the uniqueness of our physical theories. 
Which is exactly why a lot of these other things are overused. Hamming's seems like an extension or corollary[1] and I even think Norvig's (Halevy's) is highly appropriate[2]. It is "unreasonable" to think these things would be effective.

  -------------------------------------
With this paper?

I think is fine. It is being used in a similar way to Winger, with similar context.

I can see two camps. One has always interpreted the COT as analogous to a model's internal dialogue. While the other has always thought there's a much larger gap between the manipulations within latent representations and what has been decoded, not necessarily needing be strongly aligned.[3] To the former, the results here would be shocking, while to the latter it is "yes, and?" Clearly they're addressing the former camp. There were plenty of people that Winger did not need to convince.

I'm of the latter camp[4], and I'm happy people are not just asserting and are demonstrating. Honestly, I'm even frequently upset when works get dismissed because they "demonstrate something we already knew" but no one had ever actually demonstrated. The proofs and evidencing is more important than the answer. Quite often we're highly certain about results but they are difficult to even evidence (let alone prove). I mean it would be quite silly to dismiss a proof that P != NP, even though the vast majority of us have long been convinced that this is the relationship we'll end up with. Yet, no one's done it.

  -------------------------------------
[0] https://web.archive.org/web/20210212111540/http://www.dartmo...

[1] https://math.dartmouth.edu/~matc/MathDrama/reading/Hamming.h...

[2] https://static.googleusercontent.com/media/research.google.c...

[3] Both camps can be further broken down too. Lots of nuances and opinions here and the lines really get fuzzy as we try to make it more accurate. I don't want to pretend there's a hard defining line, but the distinction helps the discussion and I think is reasonably accurate enough. Let me know if you think it is a gross mischaracterization.

[4] I can expand more why this side seems "obvious" to me. But a warning, you can probably guess I'm not good at being terse.

[Note]: I'd even go so far as say we should revisit Winger's argument around AI. I'm certain mathematics can be and will be "unreasonably effective." But not enough time has been dedicated to formulate the right type of math to use. We really do have to invent a new kind here. This may sound weird to non-mathematicians, but even physics uses multiple kinds of mathematics. The operations, fields, and algebras you use in one part may not be appropriate in another part. That's okay. But we don't have a TOE yet either, and that's a critical part of finding a TOE, is bringing all this together.

valine · 4h ago
I think it’s helpful to remember that language models are not producing tokens, they are producing a distribution of possible next tokens. Just because your sampler picks a sequence of tokens that contain incorrect reasoning doesn't mean a useful reasoning trace isn’t also contained within the latent space.

It’s a misconception that transformers reason in token space. Tokens don’t attend to other tokens. High dimensional latents attend to other high dimensional latents. The final layer of a decoder only transformer has full access to entire latent space of all previous latents, the same latents you can project into a distribution of next tokens.

woadwarrior01 · 3h ago
> Just because your sampler picks a sequence of tokens that contain incorrect reasoning doesn't mean a useful reasoning trace isn’t also contained within the latent space.

That's essentially the core idea in Coconut[1][2], to keep the reasoning traces in a continuous space.

[1]: https://arxiv.org/abs/2412.06769

[2]: https://github.com/facebookresearch/coconut

jacob019 · 3h ago
So you're saying that the reasoning trace represents sequential connections between the full distribution rather than the sampled tokens from that distribution?
valine · 3h ago
The lower dimensional logits are discarded, the original high dimensional latents are not.

But yeah, the LLM doesn’t even know the sampler exists. I used the last layer as an example, but it’s likely that reasoning traces exist in the latent space of every layer not just the final one, with the most complex reasoning concentrated in the middle layers.

jacob019 · 2h ago
I don't think that's accurate. The logits actually have high dimensionality, and they are intermediate outputs used to sample tokens. The latent representations contain contextual information and are also high-dimensional, but they serve a different role--they feed into the logits.
valine · 2h ago
The dimensionality I suppose depends on the vocab size and your hidden dimension size, but that’s not really relevant. It’s a single linear projection to go from latents to logits.

Reasoning is definitely not happening in the linear projection to logits if that’s what you mean.

No comments yet

modeless · 4h ago
> we then train models on noisy, corrupted traces which have no relation to the specific problem each is paired with, and find that not only does performance remain largely consistent with models trained on correct data, but in some cases can improve upon it

This is the interesting part. We've probably all had the experience where the model is going off the rails during the thinking process but somehow spits out the right answer at the end. Apparently the reasoning doesn't even need to be correct during training?

I guess it suggests to me that the reason CoT helps is that the model gets more compute to think internally, not that the words it produces are meaningful. I'm surprised nobody has come up with a good scheme for adaptive compute per token yet. Maybe we can skip CoT entirely.

AlexCoventry · 47m ago
No, the words are meaningful to it. It's effectively using the CoT text as a "scratch space" for intermediate steps it can't calculate on one iteration through the transformer. These papers give examples of how it works:

- https://physics.allen-zhu.com/part-2-grade-school-math/part-...

- https://physics.allen-zhu.com/part-3-knowledge/part-3-3

modeless · 29m ago
I mean, this theory is directly contradicted by the paper under discussion. If you want to assert this then you need to be arguing why the paper is wrong.
trehalose · 4h ago
> We've probably all had the experience where the model is going off the rails during the thinking process but somehow spits out the right answer at the end. Apparently the reasoning doesn't even need to be correct during training?

How do we know if the reasoning was correct or not? Do we have more information about what the model was thinking besides just what it says it was thinking?

rickyhatespeas · 3h ago
It's definitely not explicitly writing out everything it's "thinking" if you are considering all dimensions of the latent space that are connected, that can't really be exhibited with a sentence.

CoT builds on existing prompt engineering techniques by adding it to reinforcement learning to force the models to build their own CoT prompt essentially. So it's not what it's thinking but all indications are that it does guide the reasoning abilities of LLMs through the output distribution.

kelseyfrog · 4h ago
> I'm surprised nobody has come up with a good scheme for adaptive compute per token yet.

I have one, I just don't have the time or money to research it :(

golol · 3h ago
Post it let's go.
istjohn · 3h ago
Uh... hmmm... uhhh... ummm...
theptip · 1h ago
So is the interpretation here something like “CoT tokens are actually neuraleese”? They do boost performance, so the model must be stashing some intermediate reasoning outputs there. But perhaps not using the literal human meaning of those tokens?
timhigins · 4h ago
This paper seems to focus on highly algorithmic/puzzle-like problems, which are not the typical application domain of LLMs, using a <500M parameter model. So my hunch is "reasoning" works much better for math, coding, factual recall, and writing tasks that most LLMs actually deal with.
throwawaymaths · 3h ago
why is it unreasonable that giving the llm a spot to think and collate long range attention and summarize without the pressure of building a meaningful next token so quickly would result in higher effectiveness?

No comments yet

nullc · 5h ago
Even when you train AI on human language, the tokens can have "subtext" that is only legible to the AI. And, unfortunately, it's not even legible to the AI in ways that it could ever explain it to us.

It's no different than how in English we can signal that a statement is related to a kind of politics or that it's about sex through particular word and phrase choice.

Training for reasoning should be expected to amplify the subtext, since any random noise in the selection that by chance is correlated with the right results will get amplified.

Perhaps you could try to dampen this by training two distinct models for a while, then swap their reasoning for a while before going back-- but sadly distinct models may still end up with similar subtexts due to correlations in their training data. Maybe ones with very distinct tokenization would be less likely to do so.

nihakue · 4h ago
This is such a bonkers line of thinking, I'm so intrigued. So a particular model will have an entire 'culture' only available or understandable to itself. Seems kind of lonely. Like some symbols might activate together for reasons that are totally incomprehensible to us, but make perfect sense to the model. I wonder if an approach like the one in https://www.anthropic.com/research/tracing-thoughts-language... could ever give us insight into any 'inside jokes' present in the model.

I hope that research into understanding LLM qualia eventually allow us to understand e.g. what it's like to [be a bat](https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F)

nullc · 3h ago
In some sense it's more human than a model trained with no RL and which has absolutely no exposure to its own output.

We have our own personal 'culture' too-- it's just less obvious because its tied up with our own hidden state. If you go back and read old essays that you wrote you might notice some of it-- that ideas and feelings (maybe smells?) that are absolutely not explicitly in the text immediately come back to you, stuff that no one or maybe only a spouse or very close friend might think.

I think it may be very hard to explore hidden subtext because the signals may be almost arbitrarily weak and context dependent. The bare model may need only a little nudge to get to the right answer and the you have this big wall of "reasoning" where each token could carry very small amounts of subtext that cumulatively add up to a lot and push things in the right direction.

candiddevmike · 4h ago
IMO this is why natural language will always be a terrible _interface_--because English is a terrible _language_ where words can have wildly different meanings that change over time. There's no ambiguity with intentions with traditional UX (or even programming languages).
nullc · 2h ago
It can happen more or less no matter what language the model uses, so long as its reinforcement trained. It's just in English we have an illusion of thinking we understand the meaning.

An example of this is toki pona, a minimalist constructed human language that is designed to only express "positive thinking". Yet it is extremely easy to insult people in toki pona: e.g. sina toki li pona pona pona pona. (you are speaking very very very very well).

To be free of a potential subtext sidechannel there can be essentially no equivalent outputs.

pona-a · 1h ago
Can't you just say "sina toki ike suli a." (you are speaking very bad <exclamation>)? Just because it doesn't have official swearwords like most natural languages doesn't mean you can only express "positive thinking".
naasking · 3h ago
I wonder if this finding would hold for something like Meta's Large Concept Models.