AI LLMs can't count lines in a file
19 sha-69 34 6/4/2025, 12:31:34 AM
Was starting to mess around with the latest LLM models and found that they're not great at counting lines in files.
I gave Gemini 2.5 flash a python script and asked it to tell me what was at line 27 and it consistently got it wrong. I tried repeatedly to prompt it the right way, but had no luck.
https://g.co/gemini/share/0276a6c7ef20
Is this something that LLM bots are still not good at? I thought they had gotten past the "strawberry" counting problems.
Here's the raw file: https://pastebin.com/FBxhZi6G
Imagine you spoke perfect English, but you learned how to write English using Mandarin characters, basically using the closest sounding Mandarin characters to write in English. Then someone asks you how many letter o's are in the sentence "Hello how are you?". Well you don't read using English charaters, you read using Mandarin characters so you read it as "哈咯,好阿优?" because using Mandarin letters that's the closest sounding way to spell "Hello how are you?"
So now if someone asks you how many letter o's are in "哈咯,好阿优?", you don't really know... you are familiar conceptually that the letter o exists, you know that if you spelled the sentence in English it would contain the letter o, and you can maybe make an educated guess about how many letter o's there are, but you can't actually count out how many letter o's there are because you've never seen actual English letters before.
The same thing goes for an LLM, they don't see characters, they only see tokens. They are aware that characters do exist, and they can reason about their existence, but they can't see them so they can't really count them out either.
Be careful that you structure your query so that all of the "hello" are in their own token, because you could inadvertently ask it where the first or last hello gets chunked into the text just before or just after.
[1] https://platform.openai.com/tokenizer
I find there's a lot of low-hanging fruit and claims about LLMs that are easily testable, but for which no benchmarks exist. E.g. the common claim about LLMs being "unable" to multiply isn't fully accurate, someone did a proper benchmark and found that there's a gradual decline in accuracy as digit length increases past 10 digits by 10 digits. I can't find the specific paper, but I also remember there was a way of training a model on increasingly hard problems at the "frontier" (GRPO-esque?) that fixed this issue, giving very high accuracy up to 20 digits by 20 digits.
I know there's a lot of theoretical CS work on deriving upper-bounds on these models from a circuit-complexity point of view, but as architectures are revised all the time it's hard to tell how much is still relevant. Nothing beats having a concrete, working example of a model that correctly parses CFGs as rebuttal to the claim that models just repeat their training data.
Someone where I work was trying to get an LLM to evaluate responses to an internal multiple-choice quiz (A, B or C), putting people into different buckets based on a combination of the total number of correct responses and having answered specific questions correctly. They spent a week "prompt engineering" it back and forth, with subtle changes to their instructions on how the scoring should work, with no appreciable effect on accuracy or consistency.
That's another scenario where I felt someone was asking for something with no mechanical sympathy for how it was supposed to happen. Maybe a "thinking" model (why do "AI" companies always abuse terms like this? (rhetorical)) would have been able to get enough stuff into the context for it to be able to get closer to a better outcome, but I took their prompt asked it to write code instead, and got it translated into some overly-commented but simple-enough code which would do the job perfectly every time, including a comment that the instructions they'd provided had a gap where people answering with a certain combination of answers wouldn't fall into any bucket.
[1] https://www.youtube.com/watch?v=7xTGNNLPyMI
That does not mean weights derived from a pile of books will do such a thing.
Tools like Claude Code work around this by feeding code into the LLMs with explicit line numbers - demo of that here: https://static.simonwillison.net/static/2025/log-2025-06-02-... - expand out some of the "tool result" panels until you see it, more notes on where I got that trace from here: https://simonwillison.net/2025/Jun/2/claude-trace/
My understanding is early LLMs were bad at math (for similar reasons) but then got better once the model was hooked up to a calculator behind the scenes.
E.g. ask it to find the 100th prime, it will write a Python script and then run that.
It's not something they can regurgitate from previously seen text. Models like Claude with background code execution might get around that.
5 Brazil liziarB
https://chatgpt.com/share/6840a944-3bac-8010-9694-2a8b0a9c35...
Even o4-mini-high got it wrong though (Indonesia)
https://chatgpt.com/share/6840a9aa-1260-8010-ba3f-bd99fff721...
I don't understand why Gemini insists that it can count the lines itself, instead of falling back to its Python tool [1].
[1] https://github.com/elder-plinius/CL4R1T4S/blob/main/GOOGLE/G...
https://chatgpt.com/share/683f9f73-42d8-8010-9cbc-27ad396a55...
ChatGPT 4o (the product not the LLM) got it right with a little additional prompting
https://chatgpt.com/share/683f9fd4-e61c-8010-99be-81d25264ba...
But use Flash so you can get a wrong answer sooner?