Can modern LLMs count the number of b's in "blueberry"?

32 minimaxir 77 8/12/2025, 4:12:02 PM minimaxir.com ↗

Comments (77)

Kwpolska · 22h ago
The Internet has been poking fun at LLMs failing to count letters for many months. Are the AI companies really living in an echo chamber? They should have implemented a thing to count letters (and to search for palindromes, and…), and just have the LLM pass the request to the count_letters function, as they do for many other tasks already…
lgl · 22h ago
So... The LLM only goes into effect after 10000 "old school" if statements?
shakow · 22h ago
Expert systems with extra steps!
andrewla · 22h ago
Why on earth would they do this? This is not a fundamentally useful task; it serves as a measure of the LLM's ability to generalize to tasks outside of its training data and that strain the limits of what it can express.
mingus88 · 21h ago
Because optics matter. they are all ultimately fundraising and competing and this is terrible PR

Ask Jeeves from 1997 could answer this question, so tell me why we need to devote a nation-state amount of compute power to feed an “AI” that confidently gets kindergarten level questions dead ass wrong?

I have the same kind of question when I watch the AI summary on Google output tokens one-by-one to give me less useful information that is right there on the first search result from Wikipedia (fully sourced, too)

mcphage · 15h ago
If you’re advertising that your new LLM is like a PhD in your pocket, and it fails on a task that a first grader can do, it makes it hard to take your other claims seriously.
DougBTX · 19h ago
> Are the AI companies really living in an echo chamber?

The author tested 12 models, and only one was consistently wrong. More than half were correct 100% of the time.

A better conclusion would be that there’s something in particular wrong with GPT-5 Chat, all the other GPT 5 variants are OK. I wonder what’s different?

Kwpolska · 16h ago
This “one model” was supposed to be the state-of-the-art model, from the company that had a model named “strawberry” in honor of the same problem. And other models struggle with this extremely basic question too.
eulgro · 16h ago
"If you are asked to count things, write a python program to do it". Problem solved.
bachittle · 22h ago
OpenAI definitely tarnished the name of GPT-5 by allowing these issues to occur. It's clearly a smaller model optimized for cost and speed. Compare it to GPT-4.5 which didn't have these errors but was "too expensive for them".

This is why Anthropic naming system of haiku sonnet and opus to represent size is really nice. It prevents this confusion.

NoahZuniga · 22h ago
> This is why Anthropic naming system of haiku sonnet and opus to represent size is really nice. It prevents this confusion.

In contrast to GPT-5, GPT-5 mini and GPT-5 nano?

prophesi · 22h ago
I think it's a valid complaint that the naming scheme for the various GPT-4 models were very confusing. GPT-5 just launched, and doesn't (yet?) have a GPT-5 Turbo or GPT-o5 mini to muddy the waters.
tempodox · 20h ago
In marketing, confusion is a feature, not a bug.
Taek · 22h ago
The problem is that GPT-5 is a smaller model than its predecessors.
csallen · 22h ago
But there's nothing in Claude's naming scheme stopping Claude 5 from being smaller than its predecessors.
hnlmorg · 22h ago
Yeah, one of the main reasons I switched my tooling over to default to Anthropic models despite starting out with OpenAI for months prior, was because I often switch between different model sizes depending on the complexity of the prompt vs the speed I want the result.

I would frequently spend time going back to OpenAIs site to remind me of their different models. There’s no consistency there whatsoever. But with Anthropic is was easy.

If I have to spend 5 minutes picking a model then I might as well do the task myself. So Claude became a natural solution for me.

andrewla · 22h ago
> OpenAI definitely tarnished the name of GPT-5 by allowing these issues to occur

For a certain class of customer maybe that is true.

But the reality is that the fact that this occurs is very encouraging -- they are not micro-optimizing to solve cosmetic problems that serve no functional purpose. They are instead letting these phenomena serve as external benchmarks of a sort to evaluate how well the LLM can work on tasks that are outside of its training data, and outside of what one would expect the capabilities to be.

radicality · 16h ago
Oh wow, I stare at those model names every day, and I only just now after reading your comment realized what “haiku”, “sonnet”, and “opus” imply about the models! Seems super obvious in retrospect but never thought about it!
rootnod3 · 22h ago
I mean yeah, but to many non-native speakers, sonnet and opus don't immediately convey size or complexity of the models.
csallen · 22h ago
I'm a well-educated native English speaker and "haiku", "sonnet", and "opus" don't immediately make me think of their size differences.
rootnod3 · 22h ago
Exactly. Doesn't mean that OpenAI has a better or worse naming. They all don't convey anything out of the gate.

4.large, 4.medium, 4.fast, 4. reasoning etc. or something similar would probably be better.

hnlmorg · 22h ago
OpenAI easily has worse naming.

Anthropic model names might not immediately conjure up their size and performance, but the name is at least internally consistent. Once you know what Anthropic call “medium”, you know what it is for all model releases.

Whereas OpenAIs naming convention, if you can even call it a “convention”, feels absolutely random to even those in the industry.

I do like your proposed naming convention though. It doesn’t sound “cool” so I can’t see any product managers approving it within the AI tech firms. But it’s definitely the best naming convention for models I’ve seen suggested for a while.

hnlmorg · 22h ago
I agree it’s not perfect. But it’s just 3 terms those non-English speakers need to learn. Which is a lot easier than having to remember every OpenAI model name and how it compares to every other one.
rootnod3 · 22h ago
Sure. I wasn't arguing that OpenAI's naming is better. It is way worse. But Anthropic also doesn't have a sure-fire naming scheme there either.
hnlmorg · 22h ago
But it’s still better. Which is the point myself and the GP are making.

It might not be perfect, but it’s still a hell of a lot better.

rootnod3 · 21h ago
So, 3 arcane barely used words in daily conversation are better than OpenAI's 4, 4o, 5, etc?
hnlmorg · 21h ago
Yes because 5 is smaller than 4, and 4o isn’t even a number.

Also, some ChatGPT models include “gpt” in the name. Others do not.

I cannot guess what model string I need to pass. Whereas with Anthropic I can. And if I have to look it up each time on OpenAIs website, then it’s clearly garbage.

Also the “arcane barely used” part of your post is entirely subjective. I get you want to make the point that Anthropic naming is poor to support your point about OpenAI, but you’re over exaggerating your point there.

rootnod3 · 21h ago
But wouldn't the same hold true for Anthropic? Claude 5 Sonnet wouldn't imply that it is larger than Claude 4 Sonnet. The numbering there doesn't mean anything. WITHIN the numbering the sonnet, opus etc mean things, maybe, but it doesn't help anything.
NegativeLatency · 22h ago
what's so wrong with: small, medium, and large?
hnlmorg · 22h ago
What makes you think that I think there’s anything wrong with s/m/l?
iLoveOncall · 22h ago
I think non-native speakers have the ability to remember that one word equals big and another equals medium.

If anything it's a lot less confusing that the awful naming convention from OpenAI up until 5.

rootnod3 · 22h ago
How about just calling it 4.large, 4.medium, etc.? Is it that difficult?

Sure, an opus is supposed to be large, but a sonnet is not restricted in size but rather a style of poem. So sonnet and opus mean nothing when compared to each other.

mullingitover · 22h ago
The hilarious thing is that LLMs will happily explain the token problem to you, and will even list out exactly which words where will have these problems. Then in the same breath when you ask it about letter counts, it will screw up.
Mindless2112 · 22h ago
It seems like this could be easily solved in models that support tool calling by providing them with a tool that takes a token and returns the individual graphemes.

It doesn't seem valuable for the model to memorize the graphemes in each of its tokens.

jandrese · 22h ago
Yes, but are you going to special case all of these pain points? The whole point of these LLMs is that they learn from training data, not from people coding logic directly. If you do this people will come up with a dozen new ways in which the models fail. They are really not hard to find. Basically asking them to do anything novel is at risk of complete failure. The interesting bit is that LLMs tend to work best a "medium difficulty" problems. Homework questions and implementing documented APIs and things like that. Asking them to do anything completely novel tends to fail as does asking them to do something so trivial that normal humans won't bother even writing it down.
BobbyJo · 22h ago
It makes sense when users ask for information not available in the tokenized values though. In the abstract, a tool that changes tokenization for certain context contents when a prompt references said contents is probably necessary to solve this issue (if you consider it worth solving).
jandrese · 20h ago
It's a fools errand. The kinds of problems you end up coding for are the ones that are blatantly obvious and ultimately useless except as a gotcha to the AI engines. All you're doing is papering over the deficiency of the model without actually solving a problem.
BobbyJo · 15h ago
This is less a deficiency of the model, and more of a deficiency of the encoder IMO. You can consider the encoder part of the model, but I think the semantics of our conversation require differentiating between the two.
Mindless2112 · 22h ago
Tokenization is an inherent weakness of current LLM design, so it makes sense to compensate for it. Hopefully some day tokenization will no longer be necessary.
poemxo · 22h ago
That takes away from the notion that LLMs have emergent intelligent abilities. Right now it doesn't seem valuable for a model to count letters, even though it is a very basic measure of understanding. Will this continue in other domains? Will we be doing tool-calling for every task that's not just summarizing text?
mjr00 · 22h ago
> Will we be doing tool-calling for every task that's not just summarizing text?

spoiler: Yes. This has already become standard for production use cases where the LLM is an external-facing interface; you use an LLM to translate the user's human-language request to a machine-ready, well-defined schema (i.e. a protobuf RPC), do the bulk of the actual work with actual, deterministic code, then (optionally) use an LLM to generate a text result to display to the user. The LLM only acts as a user interface layer.

strbean · 22h ago
How is counting letters a measure of understanding, rather than a rote process?

The reason LLMs struggle with this is because they literally aren't thinking in English. Their input is tokenized before it comes to them. It's like asking a Chinese speaker "How many Rs are there in the word 草莓".

poemxo · 20h ago
It shows understanding that words are made up of letters and that they can be counted

Since tokens are atomic, which I didn't realize earlier, then maybe it's still intelligent if it can realize it can extract the result by writing len([b for b in word if b == my_letter]) and decide on its own to return that value.

xigoi · 16h ago
But why doesn’t the LLM reply “I can’t solve this task because I see text as tokens”, rather than give a wrong answer?
strbean · 22h ago
We're up to a gazillion parameters already, maybe the next step is to just ditch the tokenization step and let the LLMs encode the tokenization process internally?
tocs3 · 21h ago
A while back I encoded the message "Can you read this?" in rot13 (or something very similar) and sent it to chatGPT[. It answered back (yes) in with a sentence in rot13. Why could it do that but not count letters?
AaronAPU · 21h ago
At the VERY beginning I asked it to reverse a hash, and was stunned that it was able to. This was in that first moment when the transformer burst on the scene.

Turned out it just knew that particular string because it was on the internet. I don’t recall exactly but it was probably “hello world” or something like that.

Anyway to your specific question, many of the models call Python tools behind the scenes now.

crtasm · 22h ago
dang · 17h ago
Thanks! Macroexpanded:

GPT-5: "How many times does the letter b appear in blueberry?" - https://news.ycombinator.com/item?id=44832908 - Aug 2025 (321 comments)

jerf · 22h ago
It might help a bit to expand this test to a short phrase. With such a small test the model can be right for the wrong reasons; opening up a bit of space to be wrong in might sharpen the differences.

(My one-off test of the default ChatGPT model, whatever that is, got 'How many b's are there in "Billy Bob beat the record for bounding?"' correct first try, with correct reasoning given.)

dvrj101 · 22h ago
i tried with three different models : - hf.co/lmstudio-community/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0 - qwen3:4b-instruct-2507-fp16 - hf.co/ibm-granite/granite-3.3-2b-instruct-GGUF:F16

ollama run qwen3:4b-instruct-2507-fp16 >>> how many b’s are there in blueberry? Let's break down the word *"blueberry"* step by step to count how many *b's* are in it.

Step 1: Write out the word clearly: *b l u e b e r r y*

Step 2: Go through each letter one by one:

- b → yes, a *b* - l → no - u → no - e → no - b → yes, another *b* - e → no - r → no - r → no - y → no

Step 3: Count the *b's*:

- First *b* at the beginning - Second *b* after "blue"

So, there are *two b's*.

Final answer: *2* b's in "blueberry".

>>> ollama run hf.co/ibm-granite/granite-3.3-2b-instruct-GGUF:F16 >>> how many b’s are there in blueberry? The word "blueberry" contains two 'b's. (fastest lol, granite models are pretty underated)

r1-distill output was similar to qwen instruct one but it double checked it's thinking part

dvrj101 · 21h ago
Zenst · 22h ago
I gave it a puzzle recently to test it out "An Ex-cop lost has home, his car and his girlfriend. What did he loose first?"

The AI thought and concluded that he had lost his job first, until I pointed out that it was not the first thing he had lost - which was his umbilical cord, a far better answer, in the AI's opinion.

Which raises many aspects - Can an AI disagree with you? Will AI develop solid out-of-the-box thinking as well as in-the-box thinking, will it grasp applying both for a thru the box thinking and solutions...

After all, we have yet to perfect the teaching of children, so the training of AI, has a long way to go and will get down to quality over quantity, just deciding what is quality and what is not. After all - Garbage in, Garbage out, is probably more important today than it ever was in the history of technology.

wslh · 22h ago
Beyond the hype, and even the recent breakthroughs in LLMs, it would be very valuable to start compiling and summarizing their specific limitations.
fantasizr · 22h ago
perhaps apocryphal but “A single lie discovered is enough to create doubt in every truth expressed.”. Why trust these tools for the hard things when we don't for the 'easy' ones.
ChrisArchitect · 21h ago
Related:

GPT-5: "How many times does the letter b appear in blueberry?"

https://news.ycombinator.com/item?id=44832908

ninetyninenine · 22h ago
>This is because the transformers are not able to take in or output actual text efficiently. Instead, the text is converted into numerical representations of itself, which is then contextualized to help the AI come up with a logical response. In other words, the AI might know that the tokens “straw” and “berry” make up “strawberry,” but it may not understand that “strawberry” is composed of the letters “s,” “t,” “r,” “a,” “w,” “b,” “e,” “r,” “r,” and “y,” in that specific order. Thus, it cannot tell you how many letters — let alone how many “r”s — appear in the word “strawberry.”

This is a great example. The LLM doesn't know something but it makes up something in it's place. Just because it made up something doesn't mean it's incapable of reasoning.

The thing with LLMs is that they can reason. There's evidence for that. But they can also be creative. And the line between reasoning and creativity at a low level is a bit of a blur as reasoning is a form of inference but so is creativity. So when an LLM reasons or gets creative or hallucinates it's ultimately doing the same type of thing: inference.

For us, we have mechanisms in our brain that allow us to tell the difference most of the time. The LLM does not. That's the fundamental line. And I feel because of this we are literally really close to AGI. A lot of people argue the opposite. They think reasoning and is core to intelligence and a separate concept from creativity and that all LLMs lack reasoning. I disagree.

In fact humans ourselves have trouble separating hallucination from reasoning. Look at religion. Religion permeates our culture but it's basically all hallucinations that we ultimately mistake for reasoning. Right? Ask any christian or muslim, the religions make rational sense to them! They can't tell the difference.

So the key is to give the LLM the ability to know the difference.

Is there some way to build into the transformer, some way to quantify whether something is fact or fiction? Like let's say the answer to a prompt created an inferenced datapoint that's very far far away from a cluster of data. From that we can derive some metric that quantifies how likely the response is based on evidence?

Right? The whole thing is on a big mathematical multidimensional durve. If the inferenced point on the curve is right next to existing data then it must be more likely to be true. If it's far away in some nether region of the curve then it's more likely to be false.

If the LLM can be more self aware and we can build this quantitative metric into the network then use reinforcement learning to kind of have the network be less sure about an answer if it's far away from a cluster of training data points we can likely very much improve the hallucination problem.

Of course I'm sure this is a blunt instrument as even false inferences data can be very close to existing training data. But at least this gives the LLM some level of self awareness of how reliable it's own answer was.

parpfish · 22h ago
Project manager: “great news! Our model can count Rs in strawberry!”

Dev: “What about Bs in blueberry?”

PM: “you’ll need to open a new jira ticket”

bwfan123 · 22h ago
this is literally what likely happens at these companies. ie, have teams that monitor twitter/social media for fails, and fix them with data patches.
bigstrat2003 · 22h ago
Which is why I don't trust any of the benchmarks LLM enthusiasts point to when they say "see the model is getting better". I have zero confidence that the AI companies are trying to make the system better, rather than using the measure as a target.
SpaceNoodled · 22h ago
That reminds me of the time I found thread-safety-breaking changes in Intel's custom Android framework that were clearly designed to cheat benchmarks.
qafy · 21h ago
tldr; yes, except for GPT-5
rwmj · 22h ago
Can modern humans count the number of tokens in "blueberry"? (You're not allowed to cheat and use https://platform.openai.com/tokenizer)
ryanmcbride · 22h ago
I see this argument every time this particular failing is brought up and like, regardless of the reasoning, it's still completely failing to do something a second grader should be able to do and until it stops failing in that way, being like "Um technically it's a lot harder than you think" isn't going to win over AI skeptics.
strbean · 22h ago
A Chinese second grader would probably not be able to answer "How many Rs are in the word 草莓?" It's a valid critique.
ryanmcbride · 22h ago
You're right they probably wouldn't be able to to it, but that child isn't being marketed as an invaluable business tool to assist doctors in diagnosing cancer, or whatever.
strbean · 21h ago
Sure.

A non-English speaking adult wouldn't be able to answer the question either, after the question was translated to their language. Maybe you wouldn't want a non-English speaker helping you to write an acrostic in English. Luckily nobody is marketing LLMs as "great for designing word puzzles" though.

ryanmcbride · 20h ago
You can make comparisons all you want but you can't change the fact that it just plain looks bad.
xigoi · 16h ago
No, but I can recognize my inability to do so instead of making the answer up.
bgwalter · 22h ago
One is an unambiguous question, the other depends on the tokenizing algorithm.
dgfitz · 22h ago
Uh, who cares?

That is literally useless knowledge.

strbean · 22h ago
When's the last time you needed to know the number of Rs in "blueberry"?
crtasm · 22h ago
When's the last time you needed to count the number of things inside a larger thing?
strbean · 21h ago
What's your point? Parent said "That's literally useless knowledge" in response to not knowing the number of tokens in "blueberry". Are we talking about the specific question or the general notion of counting?

The point is that LLMs don't speak / think in English. Asking them about spelling is like asking a Chinese speaker, during a text chat with translation, about English spelling. We can give the Chinese speaker access to an app to translate back to English so they can answer these questions. But they (the LLM) don't currently have access to that.

dgfitz · 21h ago
> The point is that LLMs don't speak / think in English.

They don't think or speak anything at all. They use a statistical model to predict the next most likely token to display in response to a prompt.

strbean · 21h ago
Distinction without a difference.

s/speak \/ think/operate/

If that makes you happy.

dgfitz · 20h ago
There is an enormous difference between claiming an LLM can think or use statistics to generate a token. All of the difference.