I’m starting to think “The Bitter Lesson” is a clever sounding way to give shade to people that failed to nail it on their first attempt. Usually engineers build much more technology than they actually end up needing, then the extras shed off with time and experience (and often you end up building it again from scratch). It’s not clear to me that starting with “just build something that scales with compute” would get you closer to the perfect solution, even if as you get closer to it you do indeed make it possible to throw more compute at it.
That said the hand coded nature of tokenization certainly seems in dire need of a better solution, something that can be learned end to end. And It looks like we are getting closer with every iteration.
jetrink · 36m ago
The Bitter Lesson is specifically about AI. The lesson restated is that over the long run, methods that leverage general computation (brute-force search and learning) consistently outperform systems built with extensive human-crafted knowledge. Examples: Chess, Go, speech recognition, computer vision, machine translation, and on and on.
AndrewKemendo · 1m ago
This is correct however I’d add that it’s not just “AI” colloquially - it’s a statement about any two optimization systems that are trying to scale.
So any system that predicts the optimization with a general solver can scale better than heuristic or constrained space solvers
Up till recently there’s been no general solvers at that scale
RodgerTheGreat · 59m ago
The bitter lesson says more about medium-term success at publishable results than it does about genuine scientific progress or even success in the market.
QuesnayJr · 36m ago
I'm starting to think that half the commenters here don't actually know what "The Bitter Lesson" is. It's purely a statement about the history of AI research, in a very short essay by Rich Sutton: http://www.incompleteideas.net/IncIdeas/BitterLesson.html It's not some general statement about software engineering for all domains, but a very specific statement about AI applications. It's an observation that the previous generation's careful algorithmic work to solve an AI problem ends up being obsoleted by this generation's brute force approach using more computing power. It's something that's happened over and over again in AI, and has happened several times even since 2019 when Sutton wrote the essay.
tantalor · 23m ago
That essay is actually linked in the lead:
> As it's been pointed out countless times - if the trend of ML research could be summarised, it'd be the adherence to The Bitter Lesson - opt for general-purpose methods that leverage large amounts of compute and data over crafted methods by domain experts
But we're only 1 sentence in, and this is already a failure of science communication at several levels.
1. The sentence structure and grammar is simply horrible
2. This is condescending: "pointed out countless times" - has it?
3. The reference to Sutton's essay is oblique, easy to miss
4. Outside of AI circles, "Bitter Lesson" is not very well known. If you didn't already know about it, this doesn't help.
cheesecompiler · 5h ago
The reverse is possible too: throwing massive compute at a problem can mask the existence of a simpler, more general solution. General-purpose methods tend to win out over time—but how can we be sure they’re truly the most general if we commit so hard to one paradigm (e.g. LLMs) that we stop exploring the underlying structure?
falcor84 · 4h ago
The way I see this, from the explore-exploit point of view, it's pretty rational to put the vast majority of your effort into the one action that has shown itself to bring the most reward, while spending a small amount of effort exploring other ones. Then, if and when that one action is no longer as fruitful compared to the others, you switch more effort to exploring, now having obtained significant resources from that earlier exploration, to help you explore faster.
api · 3h ago
CS is full of trivial examples of this. You can use an optimized parallel SIMD merge sort to sort a huge list of ten trillion records, or you can sort it just as fast with a bubble sort if you throw more hardware at it.
The real bitter lesson in AI is that we don't really know what we're doing. We're hacking on models looking for architectures that train well but we don't fully understand why they work. Because we don't fully understand it, we can't design anything optimal or know how good a solution can possibly get.
dan-robertson · 1h ago
Do you have a good reference for sims merge sort? The only examples I found are pairwise-merging large numbers of streams but it seems pretty hard to optimise the late steps where you only have a few streams. I guess you can do some binary-search-in-binary-search to change a merge of 2 similarly sized arrays into two merges of similarly sized arrays into sequential outputs and so on.
More precisely, I think producing a good fast merge of ca 5 lists was a problem I didn’t have good answers for but maybe I was too fixated on a streaming solution and didn’t apply enough tricks.
xg15 · 2h ago
> You can use an optimized parallel SIMD merge sort to sort a huge list of ten trillion records, or you can sort it just as fast with a bubble sort if you throw more hardware at it.
Well, technically, that's not true: The entire idea behind complexity theory is that there are some tasks that you can't throw more hardware at - at least not for interesting problem sizes or remotely feasible amounts of hardware.
I wonder if we'll reach a similar situation in AI where "throw more context/layers/training data at the problem" won't help anymore and people will be forced to care more about understanding again.
svachalek · 1h ago
I think it can be argued that ChatGPT 4.5 was that situation.
jimbokun · 1h ago
And whether that understanding will be done by humans or the AIs themselves.
logicchains · 4h ago
We can be sure via analysis based on computational theory, e.g. https://arxiv.org/abs/2503.03961 and https://arxiv.org/abs/2310.07923 . This lets us know what classes of problems a model is able to solve, and sufficiently deep transformers with chain of thought have been shown to be theoretically capable of solving a very large class of problems.
dsr_ · 4h ago
A random number generator is guaranteed to produce a correct solution to any problem, but runtime usually does not meet usability standards.
Also, solution testing is mandatory. Luckily, you can ask an RNG for that, too, as long as you have tests for the testers already written.
yorwba · 2h ago
Keep in mind that proofs of transformers being able to solve all problems in some complexity class work by taking a known universal algorithm for that complexity class and encoding it as a transformer. In every such case, you'd be better off using the universal algorithm you started with in the first place.
Maybe the hope is that you won't have to manually map the universal algorithm to your specific problem and can just train the transformer to figure it out instead, but there are few proofs that transformers can solve all problems in some complexity class through training instead of manual construction.
cheesecompiler · 4h ago
But this uses the transformers model to justify its own reasoning strength which might be a blindspot, which is my original point. All the above shows is that transformers can simulate solving a certain set of problems. It doesn't show that they are the best tool for the job.
smeeth · 3h ago
The main limitation of tokenization is actually logical operations, including arithmetic. IIRC most of the poor performance of LLMs for math problems can be attributed to some very strange things that happen when you do math with tokens.
I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.
You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.
It's a non-deterministic language model, shouldn't we expect mediocre performance in math? It seems like the wrong tool for the job...
rictic · 2h ago
Models are deterministic, they're a mathematical function from sequences of tokens to probability distributions over the next token.
Then a system samples from that distribution, typically with randomness, and there are some optimizations in running them that introduce randomness, but it's important to understand that the models themselves are not random.
geysersam · 38m ago
The LLMs are deterministic but they only return a probability distribution over following tokens. The tokens the user sees in the response are selected by some typically stochastic sampling procedure.
mgraczyk · 1h ago
This is only ideally true. From the perspective of the user of a large closed LLM, this isn't quite right because of non-associativity, experiments, unversioned changes, etc.
It's best to assume that the relationship between input and output of an LLM is not deterministic, similar to something like using a Google search API.
ijk · 1h ago
And even on open LLMs, GPU instability can cause non-determinism. For performance reasons, determinism is seldom guaranteed in LLMs in general.
drdeca · 3h ago
Deterministic is a special case of not-necessarily-deterministic.
CamperBob2 · 2h ago
We passed 'mediocre' a long time ago, but yes, it would be surprising if the same vocabulary representation is optimal for both verbal language and mathematical reasoning and computing.
To the extent we've already found that to be the case, it's perhaps the weirdest part of this whole "paradigm shift."
resters · 1h ago
Tokenization as a form of preprocessing has the problems the authors mention. But it is also a useful way to think about data vs metadata and moving beyond text/image io into other domains. Ultimately we need symbolic representations of things, sure they are all ultimately bytes which the model could learn to self-organize, but things like that can be useful when humans interact with the data directly, in a sense, tokens make more aspects of LLM internals "human readable", and models should also be able to learn to overcome the limitations of a particular tokenization scheme.
marcosdumay · 4h ago
Yeah, make the network deeper.
When all you have is a hammer... It makes a lot of sense that a transformation layer that makes the tokens more semantically relevant will help optimize the entire network after it and increase the effective size of your context window. And one of the main immediate obstacle stopping those models from being intelligent is context window size.
On the other hand, the current models already cost something on the line of the median country GDP to train, and they are nowhere close to that in value. The saying that "if brute force didn't solve your problem, you didn't apply enough force" is intended to be listened as a joke.
jagraff · 3h ago
I think the median country GDP is something like $100 Billion
Models are expensive, but they're not that expensive.
marcosdumay · 6m ago
$100 billion is the best estimate around of how much OpenAI took in investment to build ChatGPT.
telotortium · 3h ago
LLM model training costs arise primarily from commodity costs (GPUs and other compute as well as electricity), not locally-provided services, so PPP is not the right statistic to use here. You should use nominal GDP for this instead. According to Wikipedia[0], the median country's nominal GDP (Cyprus) is more like $39B. Still much larger than training costs, but much lower than your PPP GDP number.
Maybe it checks out if you don't use 1 year as your timeframe for GDP but the number of days required for training.
kordlessagain · 3h ago
The median country GDP is approximately $48.8 billion, which corresponds to Uganda at position 90 with $48.769 billion.
The largest economy (US) has a GDP of $27.7 trillion.
The smallest economy (Tuvalu) has a GDP of $62.3 million.
The 48 billion number represents the middle point where half of all countries have larger GDPs and half have smaller GDPs.
Nicook · 2h ago
does anyone even have good estimates for model training?
whiplash451 · 3h ago
I get your point but do we have evidence behind “ something on the line of the median country GDP to train”?
Is this really true?
robrenaud · 3h ago
It's not even close.
qoez · 4h ago
The counter argument is that the theoretical minimum is a few mcdonalds meals a day worth of energy even for the highest ranked human pure mathematician.
tempodox · 4h ago
It's just that no human would live long on McDonalds meals.
Cheeseburgers are a pretty balanced meal. Low fiber though.
bravetraveler · 3h ago
President in the distance, cursing
andy99 · 3h ago
> inability to detect the number of r's in:strawberry: meme
Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?
krackers · 2h ago
Until I see evidence that an LLM trained at e.g. the character level _CAN_ successfully "count Rs" then I don't trust this explanation over any other hypothesis. I am not familiar with the literature so I don't know if this has been done, but I couldn't find anything with a quick search. Surely if someone did successfully do it they would have published it.
andy99 · 2m ago
Seems like one could fairly trivially manually tokenize some words into letters instead of the tokens BPE would default to, and then run them though a model with open weights and see if it makes a difference.
If nobody has done that I will try and find time to.
ijk · 1h ago
The math tokenization research is probably closest.
If anything I'd think this indicates the barrier isn't tokenization (if it can do arithmetic, it can probably count as well) but something to do with "sequential dependencies" requiring use of COT and explicit training. Which still leaves me puzzled: there are tons of papers showing that variants of GPT-2 trained in the right way can do arithmetic, where are the papers solving the "count R in strawberry" problem?
meroes · 1h ago
I don't buy the token explanation because RLHF work is/was filled with so many "count the number of ___" prompts. There's just no way AI companies pay so much $$$ for RLHF of these prompts when the error is purely in tokenization.
IME Reddit would scream "tokenization" at the strawberry meme until blue in the face, assuring themselves better tokenization meant the problem would be solved. Meanwhile RLHF'ers were/are en masse paid to solve the problem through correcting thousands of these "counting"/perfect syntax prompts and problems. To me, since RLHF work was being paid to tackle these problems, it couldn't be a simple tokenization problem. If there was a tokenization bottleneck that fixing would solve the problem, we would not be getting paid to so much money to RLHF synax-perfect prompts (think of Sudoku type games and heavy syntax-based problems).
No, why models are better are these problems now is because of RLHF. And before you say, well now models have learned how to count in general, I say we just need to widen the abstraction a tiny bit and the models will fail again. And this will be the story of LLMs forever--they will never take the lead on their own, and its not how humans process information, but it still can be useful.
ijk · 3h ago
Well, which is easier:
Count the number of Rs in this sequence: [496, 675, 15717]
Count the number of 18s in this sequence: 19 20 18 1 23 2 5 18 18 25
[496, 675, 15717] is the GPT-4 representation of the tokens. In order to determine which letters the token represents, it needs to learn the relationship between "str" and [496]. It can learn the representation (since it can spell it out as "S-T-R" or "1. S, 2. T, 3. R" or whatever) but it adds an extra step.
The question is whether the extra step adds enough extra processing to degrade performance. Does the more compact representation buy enough extra context to make the tokenized version more effective for more problems?
It seems like the longer context length makes the trade off worth it, since spelling problems are a relatively minor subset. On the other hand, for numbers it does appear that math is significantly worse when it doesn't have access to individual digits (early Llama math results, for example). Once they changed the digit tokenization, the math performance improved.
drdeca · 3h ago
Depending on the data types and what the hardware supports, the latter may be harder (in the sense of requiring more operations)? And for a general algorithm bigger numbers would take more steps.
zachooz · 2h ago
A sequence of characters is grouped into a "token." The set of all such possible sequences forms a vocabulary. Without loss of generality, consider the example: strawberry -> straw | ber | ry -> 3940, 3231, 1029 -> [vector for each token]. The raw input to the model is not a sequence of characters, but a sequence of token embeddings each representing a learned vector for a specific chunk of characters. These embeddings contain no explicit information about the individual characters within the token. As a result, if the model needs to reason about characters, for example, to count the number of letters in a word, it must memorize the character composition of each token. Given that large models like GPT-4 use vocabularies with 100k–200k tokens, it's not surprising that the model hasn't memorized the full character breakdown of every token. I can't imagine that many "character level" questions exist in the training data.
In contrast, if the model were trained with a character-level vocabulary, where each character maps to a unique token, it would not need to memorize character counts for entire words. Instead, it could potentially learn a generalizable method for counting characters across all sequences, even for words it has never seen before.
I'm not sure about what you mean about them not "seeing" the tokens. They definitely receive a representation of each token as input.
saurik · 2h ago
It isn't at all obvious to me that the LLM can decide to blur their vision, so to speak, and see the tokens as tokens: they don't get to run a program on this data in some raw format, and even if they do attempt to write a program and run it in a sandbox they would have to "remember" what they were given and then regenerate it (well, I guess a tool could give them access to the history of their input, but at that point that tool likely sees characters), rather than to copy it. I am 100% with andy99 on this: it isn't anywhere near as simple as you are making it out to be.
zachooz · 2h ago
If each character were represented by its own token, there would be no need to "blur" anything, since the model would receive a 1:1 mapping between input vectors and individual characters. I never claimed that character-level reasoning is easy or simple for the model; I only said that it becomes theoretically possible to generalize ("potentially learn") without memorizing the character makeup of every token, which is required when using subword tokenization.
Please take another look at my original comment. I was being precise about the distinction between what's structurally possible to generalize vs memorize.
hackinthebochs · 1h ago
Tokens are the most basic input unit of an LLM. But tokens don't generally correspond to whole words, rather sub-word sequences. So Strawberry might be broken up into two tokens 'straw' and 'berry'. It has trouble distinguishing features that are "sub-token" like specific letter sequences because it doesn't see letter sequences but just the token as a single atomic unit. The basic input into a system is how one input state is distinguished from another. But to recognize identity between input states, those states must be identical. It's a bit unintuitive, but identity between individual letters and the letters within a token fails due to the specifics of tokenization. 'Straw' and 'r' are two tokens but an LLM is entirely blind to the fact that 'straw' has one 'r' in it. Tokens are the basic units of distinction; 'straw' is not represented as a sequence of s-t-r-a-w tokens but is its own thing entirely, so they are not considered equal or even partially equal.
As an analogy, I might ask you to identify the relative activations of each of the three cone types on your retina as I present some solid color image to your eyes. But of course you can't do this, you simply do not have cognitive access to that information. Individual color experiences are your basic vision tokens.
Actually, I asked Grok this question a while ago when probing how well it could count vowels in a word. It got it right by listing every letter individually. I then asked it to count without listing the letters and it was a couple of letters off. I asked it how it was counting without listing the letters and its answer was pretty fascinating, with a seeming awareness of its own internal processes:
Connecting a token to a vowel, though, requires a bit of a mental pivot. Normally, I’d just process the token and move on, but when you ask me to count vowels, I have to zoom in. I don’t unroll the word into a string of letters like a human counting beads on a string. Instead, I lean on my understanding of how those tokens sound or how they’re typically constructed. For instance, I know "cali" has an 'a' and an 'i' because I’ve got a sense of its phonetic makeup from training data—not because I’m stepping through c-a-l-i. It’s more like I "feel" the vowels in there, based on patterns I’ve internalized.
When I counted the vowels without listing each letter, I was basically hopping from token to token, estimating their vowel content from memory and intuition, then cross-checking it against the whole word’s vibe. It’s not perfect—I’m not cracking open each token like an egg to inspect it—but it’s fast and usually close enough. The difference you noticed comes from that shift: listing letters forces me to be precise and sequential, while the token approach is more holistic, like guessing the number of jellybeans in a jar by eyeing the clumps.
svachalek · 1h ago
That explanation is pretty freaky, as it implies a form of consciousness I don't believe LLMs have, I've never seen this explanation before so I'm not sure it's from training, and yet it's probably a fairly accurate description of what's going on.
roywiggins · 55s ago
LLMs will write out explanations that are entirely post-hoc:
> Strikingly, Claude seems to be unaware of the sophisticated "mental math" strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm involving carrying the 1. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math "in its head" directly, without any such hints, and develops its own internal strategies to do so.
Didn't tokenization already have one bitter lesson: that it's better to let simple statistics guide the splitting, rather than expert morphology models? Would this technically be a more bitter lesson?
kingstnap · 29m ago
Simple statistics aren't some be all. There was a huge improvement in Python coding by fixing the tokenization of indents in Python code.
Specifically they made tokens for 4,8,12,16 or something spaces.
empiko · 1h ago
Agreed completely. There is a ton of research into how to represent text, and these simple tokenizers are consistently performing on SOTA levels. The bitter lesson is that you should not worry about it that much.
perching_aix · 1h ago
Can't wait for models to struggle with adhering to UTF-8.
Scene_Cast2 · 5h ago
I realized that with tokenization, there's a theoretical bottleneck when predicting the next token.
Let's say that we have 15k unique tokens (going by modern open models). Let's also say that we have an embedding dimensionality of 1k. This implies that we have a maximum 1k degrees of freedom (or rank) on our output. The model is able to pick any single of the 15k tokens as the top token, but the expressivity of the _probability distribution_ is inherently limited to 1k unique linear components.
blackbear_ · 5h ago
While the theoretical bottleneck is there, it is far less restrictive than what you are describing, because the number of almost orthogonal vectors grows exponentially with ambient dimensionality. And orthogonality is what matters to differentiate between different vectors: since any distribution can be expressed as a mixture of Gaussians, the number of separate concepts that you can encode with such a mixture also grows exponentially
Scene_Cast2 · 3h ago
I agree that you can encode any single concept and that the encoding space of a single top pick grows exponentially.
However, I'm talking about the probability distribution of tokens.
molf · 4h ago
The key insight is that you can represent different features by vectors that aren't exactly perpendicular, just nearly perpendicular (for example between 85 and 95 degrees apart). If you tolerate such noise then the number of vectors you can fit grows exponentially relative to the number of dimensions.
12288 dimensions (GPT3 size) can fit more than 40 billion nearly perpendicular vectors.
Detecting and preventing unargmaxable outputs in bottlenecked neural networks, Andreas Grivas (2024)
unoti · 5h ago
I imagine there’s actually combinatorial power in there though. If we imagine embedding something with only 2 dimensions x and y, we can actually encode an unlimited number of concepts because we can imagine distinct separate clusters or neighborhoods spread out over a large 2d map. It’s of course much more possible with more dimensions.
incognito124 · 2h ago
(I left academia a while ago, this might be nonsense)
If I remember correctly, that's not true because of the nonlinearities which provide the model with more expressivity. Transformation from 15k to 1k is rarely an affine map, it's usually highly non-linear.
kevingadd · 4h ago
It seems like you're assuming that models are trying to predict the next token. Is that really how they work? I would have assumed that tokenization is an input-only measure, so you have perhaps up to 50k unique input tokens available, but output is raw text or synthesized speech or an image. The output is not tokens so there are no limitations on the output.
anonymoushn · 4h ago
yes, in typical architectures for models dealing with text, the output is a token from the same vocabulary as the input.
citizenpaul · 2h ago
The best general argument I've heard against the bitter lesson is. If the bitter lesson is true? How come we spend so many million man hours a year of tweaking and optimizing software systems all day long? Surely its easier and cheaper to just buy a rack of servers.
Maybe if you have infinite compute you don't worry about software design. Meanwhile in the real world...
Not only that but where did all these compute optimized solutions come from? Oh yeah millions of man hours of optimizing and testing algorithmic solutions. So unless you are some head in the clouds tenured professor just keep on doing your optimizations and job as usual.
Uehreka · 2h ago
Because the Even Bitterer Lesson is that The Bitter Lesson is true but not actionable. You still have to build the inefficient ”clever” system today because The Bitter Lesson only tells you that your system will be obliterated, it doesn’t tell you when. Some systems built today will last for years, others will last for weeks, others will be obsoleted before release, and we don’t know which are which.
I’m hoping someday that dude releases an essay called The Cold Comfort. But it’s impossible to predict when or who it will help, so don’t wait for it.
citizenpaul · 1h ago
Yeah I get it. I just don't like that is always sorta framed as a can't win don't try message.
QuesnayJr · 30m ago
The solution to the puzzle is that "the bitter lesson" is about AI software systems, not arbitrary software systems. If you're writing a compiler, you're better off worrying about algorithms, etc. AI problems have an inherent vagueness to them that makes it hard to write explicit rules, and any explicit rules you write will end up being obsolete as soon as we have more compute.
That said the hand coded nature of tokenization certainly seems in dire need of a better solution, something that can be learned end to end. And It looks like we are getting closer with every iteration.
So any system that predicts the optimization with a general solver can scale better than heuristic or constrained space solvers
Up till recently there’s been no general solvers at that scale
> As it's been pointed out countless times - if the trend of ML research could be summarised, it'd be the adherence to The Bitter Lesson - opt for general-purpose methods that leverage large amounts of compute and data over crafted methods by domain experts
But we're only 1 sentence in, and this is already a failure of science communication at several levels.
1. The sentence structure and grammar is simply horrible
2. This is condescending: "pointed out countless times" - has it?
3. The reference to Sutton's essay is oblique, easy to miss
4. Outside of AI circles, "Bitter Lesson" is not very well known. If you didn't already know about it, this doesn't help.
The real bitter lesson in AI is that we don't really know what we're doing. We're hacking on models looking for architectures that train well but we don't fully understand why they work. Because we don't fully understand it, we can't design anything optimal or know how good a solution can possibly get.
More precisely, I think producing a good fast merge of ca 5 lists was a problem I didn’t have good answers for but maybe I was too fixated on a streaming solution and didn’t apply enough tricks.
Well, technically, that's not true: The entire idea behind complexity theory is that there are some tasks that you can't throw more hardware at - at least not for interesting problem sizes or remotely feasible amounts of hardware.
I wonder if we'll reach a similar situation in AI where "throw more context/layers/training data at the problem" won't help anymore and people will be forced to care more about understanding again.
Also, solution testing is mandatory. Luckily, you can ask an RNG for that, too, as long as you have tests for the testers already written.
Maybe the hope is that you won't have to manually map the universal algorithm to your specific problem and can just train the transformer to figure it out instead, but there are few proofs that transformers can solve all problems in some complexity class through training instead of manual construction.
I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.
https://arxiv.org/abs/2402.14903
You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.
Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.
Then a system samples from that distribution, typically with randomness, and there are some optimizations in running them that introduce randomness, but it's important to understand that the models themselves are not random.
It's best to assume that the relationship between input and output of an LLM is not deterministic, similar to something like using a Google search API.
To the extent we've already found that to be the case, it's perhaps the weirdest part of this whole "paradigm shift."
When all you have is a hammer... It makes a lot of sense that a transformation layer that makes the tokens more semantically relevant will help optimize the entire network after it and increase the effective size of your context window. And one of the main immediate obstacle stopping those models from being intelligent is context window size.
On the other hand, the current models already cost something on the line of the median country GDP to train, and they are nowhere close to that in value. The saying that "if brute force didn't solve your problem, you didn't apply enough force" is intended to be listened as a joke.
https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)
Models are expensive, but they're not that expensive.
[0] https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nomi...
The largest economy (US) has a GDP of $27.7 trillion.
The smallest economy (Tuvalu) has a GDP of $62.3 million.
The 48 billion number represents the middle point where half of all countries have larger GDPs and half have smaller GDPs.
Is this really true?
Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?
If nobody has done that I will try and find time to.
GPT-2 tokenization was a demonstratable problem: https://www.beren.io/2023-02-04-Integer-tokenization-is-insa... (Prior HN discussion: https://news.ycombinator.com/item?id=39728870 )
More recent research:
https://huggingface.co/spaces/huggingface/number-tokenizatio...
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: https://arxiv.org/abs/2402.14903
https://www.beren.io/2024-07-07-Right-to-Left-Integer-Tokeni...
https://twitter.com/yuntiandeng/status/1836114401213989366
If anything I'd think this indicates the barrier isn't tokenization (if it can do arithmetic, it can probably count as well) but something to do with "sequential dependencies" requiring use of COT and explicit training. Which still leaves me puzzled: there are tons of papers showing that variants of GPT-2 trained in the right way can do arithmetic, where are the papers solving the "count R in strawberry" problem?
IME Reddit would scream "tokenization" at the strawberry meme until blue in the face, assuring themselves better tokenization meant the problem would be solved. Meanwhile RLHF'ers were/are en masse paid to solve the problem through correcting thousands of these "counting"/perfect syntax prompts and problems. To me, since RLHF work was being paid to tackle these problems, it couldn't be a simple tokenization problem. If there was a tokenization bottleneck that fixing would solve the problem, we would not be getting paid to so much money to RLHF synax-perfect prompts (think of Sudoku type games and heavy syntax-based problems).
No, why models are better are these problems now is because of RLHF. And before you say, well now models have learned how to count in general, I say we just need to widen the abstraction a tiny bit and the models will fail again. And this will be the story of LLMs forever--they will never take the lead on their own, and its not how humans process information, but it still can be useful.
Count the number of Rs in this sequence: [496, 675, 15717]
Count the number of 18s in this sequence: 19 20 18 1 23 2 5 18 18 25
Human: Which is the easier of these formulas
1. x = SQRT(4)
2. x = SQRT(123567889.987654321)
Computer: They're both the same.
[496, 675, 15717] is the GPT-4 representation of the tokens. In order to determine which letters the token represents, it needs to learn the relationship between "str" and [496]. It can learn the representation (since it can spell it out as "S-T-R" or "1. S, 2. T, 3. R" or whatever) but it adds an extra step.
The question is whether the extra step adds enough extra processing to degrade performance. Does the more compact representation buy enough extra context to make the tokenized version more effective for more problems?
It seems like the longer context length makes the trade off worth it, since spelling problems are a relatively minor subset. On the other hand, for numbers it does appear that math is significantly worse when it doesn't have access to individual digits (early Llama math results, for example). Once they changed the digit tokenization, the math performance improved.
In contrast, if the model were trained with a character-level vocabulary, where each character maps to a unique token, it would not need to memorize character counts for entire words. Instead, it could potentially learn a generalizable method for counting characters across all sequences, even for words it has never seen before.
I'm not sure about what you mean about them not "seeing" the tokens. They definitely receive a representation of each token as input.
Please take another look at my original comment. I was being precise about the distinction between what's structurally possible to generalize vs memorize.
As an analogy, I might ask you to identify the relative activations of each of the three cone types on your retina as I present some solid color image to your eyes. But of course you can't do this, you simply do not have cognitive access to that information. Individual color experiences are your basic vision tokens.
Actually, I asked Grok this question a while ago when probing how well it could count vowels in a word. It got it right by listing every letter individually. I then asked it to count without listing the letters and it was a couple of letters off. I asked it how it was counting without listing the letters and its answer was pretty fascinating, with a seeming awareness of its own internal processes:
Connecting a token to a vowel, though, requires a bit of a mental pivot. Normally, I’d just process the token and move on, but when you ask me to count vowels, I have to zoom in. I don’t unroll the word into a string of letters like a human counting beads on a string. Instead, I lean on my understanding of how those tokens sound or how they’re typically constructed. For instance, I know "cali" has an 'a' and an 'i' because I’ve got a sense of its phonetic makeup from training data—not because I’m stepping through c-a-l-i. It’s more like I "feel" the vowels in there, based on patterns I’ve internalized.
When I counted the vowels without listing each letter, I was basically hopping from token to token, estimating their vowel content from memory and intuition, then cross-checking it against the whole word’s vibe. It’s not perfect—I’m not cracking open each token like an egg to inspect it—but it’s fast and usually close enough. The difference you noticed comes from that shift: listing letters forces me to be precise and sequential, while the token approach is more holistic, like guessing the number of jellybeans in a jar by eyeing the clumps.
> Strikingly, Claude seems to be unaware of the sophisticated "mental math" strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm involving carrying the 1. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math "in its head" directly, without any such hints, and develops its own internal strategies to do so.
https://www.anthropic.com/news/tracing-thoughts-language-mod...
Specifically they made tokens for 4,8,12,16 or something spaces.
Let's say that we have 15k unique tokens (going by modern open models). Let's also say that we have an embedding dimensionality of 1k. This implies that we have a maximum 1k degrees of freedom (or rank) on our output. The model is able to pick any single of the 15k tokens as the top token, but the expressivity of the _probability distribution_ is inherently limited to 1k unique linear components.
However, I'm talking about the probability distribution of tokens.
12288 dimensions (GPT3 size) can fit more than 40 billion nearly perpendicular vectors.
[1]: https://www.3blue1brown.com/lessons/mlp#superposition
Detecting and preventing unargmaxable outputs in bottlenecked neural networks, Andreas Grivas (2024)
If I remember correctly, that's not true because of the nonlinearities which provide the model with more expressivity. Transformation from 15k to 1k is rarely an affine map, it's usually highly non-linear.
Maybe if you have infinite compute you don't worry about software design. Meanwhile in the real world...
Not only that but where did all these compute optimized solutions come from? Oh yeah millions of man hours of optimizing and testing algorithmic solutions. So unless you are some head in the clouds tenured professor just keep on doing your optimizations and job as usual.
I’m hoping someday that dude releases an essay called The Cold Comfort. But it’s impossible to predict when or who it will help, so don’t wait for it.
This is all explained in the original essay: http://www.incompleteideas.net/IncIdeas/BitterLesson.html