I am sympathetic to the reasoning as to why LLMs should not be used to help some programmers right now. But I get a little frustrated seeing many of these kinds of posts that talk about fundamental limitations of LLMs vs humans on the grounds that it cannot "logically reason" like a human does. These are limitations in the current approach to training and objectives; internally, we have no clue what is going on.
> it’s “just a statistical model” that generates “language” based on a chain of “what is most likely to follow the previous phrase”
Humans are statistical models too in an appropriate sense. The question is whether we try to execute phrase by phrase or not, or whether it even matters what humans do in the long term.
> The only way ChatGPT will stop spreading that nonsense is if there is a significant mass of humans talking online about the lack of ZSTD support.
Or you can change the implicit bias in the model by being more clever with your training procedure. This is basic stats here, not everything is about data.
> They don’t know anything, they don’t think, they don’t learn, they don’t deduct. They generate real-looking text based on what is most likely based on the information it has been trained on.
This may be comforting to think, but it's just wrong. It would make my job so much easier if it were true. If you take the time to define "know", "think", and "deduct", you will find it difficult to argue current LLMs do not do these things. "Learn" is the exception here, and is a bit more complex, not only because of memory and bandwidth issues, but also because "understand" is difficult to define.
raincole · 1h ago
While the normal distribution meme is notoriously overused, I think it fits the scenario here.
LLMs know so much (when you just use ChatGPT for the first time like it's an Oracle machine) -> LLMs don't know anything (when you understand how machine learning works) -> LLMs know so much (when you actually think about what 'know' means)
libraryofbabel · 1h ago
Yeah. The empty “it’s just a statistical model” critique (or the dressed-up “stochastic parrots” version of it) is almost a sign at this point that the person using it formed their opinions about AI back when ChatGPT first came out, and hasn’t really bothered to engage with it much since then.
If in 2022 I’d tried to convince AI skeptics that in three years we might have tools on the level of Claude Code, I’m sure I’d have heard everyone say it would be impossible because “it’s just a statistical model.” But it turned out that there was a lot more potential in the architecture for encoding structured knowledge, complex reasoning, etc., despite that architecture being probabilistic. (Don’t bet against the Bitter Lesson.)
LLMs have a lot of problems, hallucination still being one of them. I’d be the first to advocate for a skeptical hype-free approach to deploying them in software engineering. But at this point we need careful informed engagement with where the models are at now rather than cherry-picked examples and rants.
seba_dos1 · 1h ago
Unless what you work on is very simple and mostly mindless, using tools like Claude Code is the exact opposite of how to make the current SotA LLMs useful for coding. The models can help and boost your productivity, but it doesn't happen by letting them do more stuff autonomously. Quite the contrary.
And when what you usually work on actually is very simple and mostly mindless, you'd probably benefit more from doing it yourself, so you can progress above the junior stuff one day.
vidarh · 1h ago
People repeating the "stochastic parrot" meme in all kinds of variations if anything appear to be more like stochastic parrots than the typical LLM is.
efilife · 1h ago
> it cannot "logically reason" like a human does
Reason? Maybe. But there's one limitation that we currently have no idea how to overcome; LLMs don't know how much they know. If they tell you they don't something it may be a lie. If they tell you they do, this may be a lie too. I, a human, certainly know what I know and what I don't and can recall from where I know the information
vidarh · 1h ago
I have never met a human who has a good grasp of what they know and don't know. They may have a better graps of it than an LLM, but humans are awfully bad at understanding the limits of our own knowledge, and will argue very strongly in favour of knowing more than we demonstrably do in all kinds of contexts.
ModernMech · 24m ago
LLMs are not humans, they are ostensibly tools. Tools are supposed to come with a list of things they can do. LLMs don’t and are therefore bad at being tools, so we anthropomorphize them. But they are also not good at being people, so LLMs are left in this weird in-between state where half the people say they’re useful and half the people say they cause more problems than they solve.
hodgehog11 · 31m ago
You are judging this based on what the LLM outputs, not on its internals. When we peer into its internals, it seems that LLMs actually have a pretty good representation of what they do and don't know; this just isn't reflected in the output because the relevant information is lost in future context.
AaronAPU · 1h ago
I’m afraid that sense of knowing what you know is very much illusory for humans as well. Everyone is just slowly having to come to terms with that.
mrcartmeneses · 32m ago
Socrates would beg to differ
lblume · 1h ago
Do you really know what you don't know? This would rule out unknown unknowns entirely.
add-sub-mul-div · 53m ago
Yes, it's not that people know specifically what they don't know, it's that they develop the wisdom to know those boundaries and anticipate them and reduce their likelihood and impact.
For example, if I use the language of my expertise for a familiar project then the boundaries where the challenges might lie are known. If I start learning a new language for the project I won't know which areas might produce unknowns.
The LLM will happily give you code in a language it's not trained well on. With the same confidence as using any other language.
gallerdude · 1h ago
> OpenAI researcher Noam Brown on hallucination with the new IMO reasoning model:
> Mathematicians used to comb through model solutions because earlier systems would quietly flip an inequality or tuck in a wrong step, creating hallucinated answers.
> Brown says the updated IMO reasoning model now tends to say “I’m not sure” whenever it lacks a valid proof, which sharply cuts down on those hidden errors.
> TLDR, the model shows a clear shift away from hallucinations and toward reliable, self‑aware reasoning.
>Humans are statistical models too in an appropriate sense.
No, we aren't and I'm getting tired of this question begging and completely wrong statement. Human beings are capable of what Kant in fancy words called "transcendental apperception", we're already bringing our faculties to bear on experience without which the world would make no sense to us.
What that means in practical terms for programming problems of this kind is that, we can say "I don't know", which the LLM can't, because there's no "I", in the LLM, no unified subject that can distinguish what it knows and what it doesn't, what's within its domain of knowledge or outside.
>If you take the time to define "know", "think", and "deduct", you will find it difficult to argue current LLMs do not do these things
No, only if you don't spend the time to think about what knowledge is you'd make such a statement. What enables knowledge, which is not raw data but synthesized, structured cognition, is the faculties of the mind a priori categories we bring to bear on data.
That's why these systems are about as useless as a monkey with a typewriter when you try to have them work on manual memory management in C, because that's less of a task in auto completion and requires you to have in your mind a working model of the machine.
lblume · 1h ago
The position of Kant does not align with the direction modern neuroscience is heading towards. Current evidence seems to prefer decentralized theories of consciousness like Dennett's multiple drafts model[1], suggesting there is no central point where everything comes together to form conscious experience, but instead that it itself is constituted by collaborative processes that have multiple realizations.
>Current evidence seems to prefer decentralized theories of consciousness like Dennett
There is no such thing as consciousness in Dennett's theory, his position is that it doesn't exist, he is a Eliminativist. This is of course an absurd position with no evidence for it as people like Chalmers have pointed out (including in that Wikipedia article), and it might be the most comical and ideological position in the last 200 years.
hodgehog11 · 39m ago
This is interesting philosophy, and others have better critiques here in that regard. I'm a mathematician, so I can only work in what I can define symbolically. Humans most certainly ARE statistical models by that definition: without invoking the precise terminology, we take input, yield output, and plausibly involve uncertain elements. One can argue as to whether this is the correct language or not, but I prefer to think this way, as the arrogance of human thinking has otherwise failed us in making good predictions about AI.
If you can come up with a symbolic description of a deficiency in how LLMs approach problems, that's fantastic, because we can use that to alter how these models are trained, and how we approach problems too!
> What that means in practical terms for programming problems of this kind is that, we can say "I don't know", which the LLM can't, because there's no "I", in the LLM, no unified subject that can distinguish what it knows and what it doesn't, what's within its domain of knowledge or outside.
We seriously don't know whether there is an "I" that is comprehended or not. I've seen arguments either way. But otherwise, this seems to refer to poor internal calibration of uncertainty, correct? This is an important problem! (It's also a problem with humans too, but I digress). LLMs aren't nearly as bad as this as you might think, and there are a lot of things you can do (that the big tech companies do not do) that can better tune it's own self-confidence (as reflected in logits). I'm not aware of anything that uses this information as part of the context, so that might be a great idea. But on the other hand, maybe this actually isn't as important as we think it is.
HDThoreaun · 52m ago
Kant was a dualist, of course he didnt think humans were statistical models. It just turns out he was (probably) wrong.
bwfan123 · 1h ago
Humans build theories of how things work. llms dont. Theories are deterministic symbolic representation of the chaotic worlds of meaning . Take the turing machine for example as a theory of computation in general, euclidean geometry as a theory for space, and newtonian mechanics as a theory for motion.
A theory gives 100% correct predictions. Although the theory itself may not model the world accurately. Such feedback between the theory, and its application in the world causes iterations to the theory. From newtonian mechanics to relativity etc.
Long story short, the LLM is a long way away from any of this. And to be fair to LLMs, the average human is not creating theories, it takes some genius to create them (newton, turing, etc).
Understanding something == knowing the theory of it.
hodgehog11 · 1h ago
> Humans build theories of how things work. llms dont. Theories are deterministic symbolic representation of the chaotic worlds of meaning
What made you believe this is true? Like it or not, yes, they do (at least to the best extent of our definitions of what you've said). There is a big body of literature exploring this question, and the general consensus is that all performant deep learning models adopt an internal representation that can be extracted as a symbolic representation.
bwfan123 · 1h ago
> What made you believe this is true?
I am yet to see a theory coming of the LLM that is sufficiently interesting. My comment was answering your question of what does it mean to "understanding something". My answer to that is: understanding something is knowing the theory of it.
Now, that begs the question of what is a theory. And to answer that, a theory comprises of building block symbols and a set of rules to combine them. for example, building blocks for space (and geometry) could be points, lines, etc. The key point in all of this is symbolism as abstractions to represent things in some world.
hodgehog11 · 25m ago
The "sufficiently interesting" part is the most important qualifier here. My response was talking about theories and representations that we already know, either instinctively from near-birth, or from learned experience. We have not seen anything unique from LLMs because they do not appear to have an internal understanding (in the same sense that I was talking about) that is as broad as an adult human. But that doesn't mean it lacks any understanding.
> The key point in all of this is symbolism as abstractions to represent things in some world.
The difficulty is understanding how to extract this information from the model, since the output of the LLM is actually a very poor representation of its internal state.
simonw · 1h ago
> This concludes all the testing for GPT5 I have to do. If a tool is able to actively mislead me this easy, which potentially results in me wasting significant amounts of time in trying to make something work that is guaranteed to never work, it’s a useless tool.
Yeah, except it isn't. You can get enormous value out of LLMs if you get over this weird science fiction requirement that they never make mistakes.
And yeah, their confidence is frustrating. Treat them like an over-confident twenty-something intern who doesn't like to admit when they get stuff wrong.
You have to put the effort in to learn how to use them with a skeptical eye. I've been getting value as a developer from LLMs since the GPT-3 era, and those models sucked.
> The only way ChatGPT will stop spreading that nonsense is if there is a significant mass of humans talking online about the lack of ZSTD support.
We actually have a robust solution for this exact problem now: run the prompt through a coding agent of some sort (Claude Code, Codex CLI, Cursor etc) that has access to the Swift compiler.
That way it can write code with the hallucinated COMPRESSION_ZSTD thing in it, observe that it doesn't compile and iterate further to figure out what does work.
Or the simpler version of the above: LLM writes code. You try and compile it. You get an error message and you paste that back into the LLM and let it have another go. That's been the main way I've worked with LLMs for almost three years now.
jpc0 · 1h ago
If that same intern, when asked something, responded that they checked, gave you a link to a document they claim have the proof / answer but does in fact no and continued to do that they wouldn’t be an intern very long. But somehow this is acceptable behaviour for an AI?
I use AI for sure, but only on things that I can easily verify is correct (run a test or some code ), because I have had the AI give me functions in an API with links to online documentation for those functions, the document exists, the function is not in it, when called out instead of doing a basic tool call the AI will double down that it is correct and you the human are wrong. That would get an intern fired but here you are standing on the interns side.
> Don’t fall into the trap of anthropomorphizing LLMs and assuming that failures which would discredit a human should discredit the machine in the same way.
jpc0 · 34m ago
> Or the simpler version of the above: LLM writes code. You try and compile it. You get an error message and you paste that back into the LLM and let it have another go. That's been the main way I've worked with LLMs for almost three years now.
I’m going to comment here about this but it’s a follow on to the other comment, this is exactly the workflow I was following. I had given it the compiler error and it blamed an environment issue, I confirmed the environment is as it claims it should be, it linked to documentation that doesn’t state what it claims is stated.
In a coding agent this would have been an endless feedback loop that eats millions of tokens.
This is the reason why I do not use coding agents, I can catch hallucinations and stop the feedback loop from ever happening in the first place without needing to watch an AI agent try to convince itself that it is correct and the compiler must be wrong.
jpc0 · 44m ago
> And yeah, their confidence is frustrating. Treat them like an over-confident twenty-something intern who doesn't like to admit when they get stuff wrong.
I was explicitly calling out this comment, that intern would get fired if when explicitly called out they not only don’t want to admit they are wrong but vehemently disagree.
The interaction was “Implement X”, it gave an implementation, I responded “function y does not exist use a different method”, it instead of following that instruction gave me a link to the documentation for the library that it claim’s contains that function and told me I am wrong.
I said the documentation it linked does not contain that function and to do something different and yet it still refused to follow instructions and pushed back.
At that point I “fired” it and wrote the code myself.
raincole · 1h ago
Wow... people unironically anthropomorphize AI to the point that they expect AI to work exactly like an human intern, otherwise it's unacceptable...
bfioca · 2h ago
>...it’s a useless tool. I don’t like collaborating with chronic liars who aren’t able to openly point out knowledge gaps...
I think a more correct take here might be "it's a tool that I don't trust enough to use without checking," or at the very least, "it's a useless tool for my purposes." I understand your point, but I got a little caught up on the above line because it's very far out of alignment with my own experience using it to save enormous amounts of time.
libraryofbabel · 53m ago
and as others have pointed out, this issue of “how much should I check” is really just a subset of an old general problem in trust and knowledge (“epistemology” or what have you) that people have recognized since at least the scientific revolution. The Royal Society’s motto on its founding in the 1660s was “take no man’s word for it.”
Coding agents have now got pretty good at checking themselves against reality, at least for things where they can run unit tests or a compiler to surface errors. That would catch the error in TFA. Of course there is still more checking to do down the line, in code reviews etc, but that goes for humans too. (This is not to say that humans and LLMs should be treated the same here, but nor do I treat an intern’s code and a staff engineer’s code the same.) It’s a complex issue that we can’t really collapse into “LLMs are useless because they get things wrong sometimes.”
lazide · 2h ago
It’s a tool that fundamentally can’t be used reliably without double checking everything it. That is rather different than you’re presenting it.
vidarh · 1h ago
We double check human work too in all kinds of contexts.
A whole lot of my schooling involved listening to teachers repeating over and over to us how we should check our work, because we can't even trust ourselves.
(heck, I had to double-check and fix typos in this comment)
mhh__ · 1h ago
Checking is usually faster than writing from scratch so this is still +EV
efilife · 1h ago
What does +EV mean? I'm looking but can't find a definition
dcrazy · 1h ago
Positive expected value. In other words, it’s likely using a LLM saves you time relative to performing the same task without using an LLM.
rcxdude · 1h ago
"Positive expected value", i.e. it'll pay off on average.
lblume · 1h ago
Expected Value?
exe34 · 1h ago
> Checking is usually faster than writing from scratch
Famous last words. Checking trivial code for trivial bugs, yes. In science, you can have very subtle bugs that bias your results in ways that aren't obvious for a while until suddenly you find yourself retracting papers.
I've used LLMs to write tedious code (that should probably have been easier if the right API had been thought through), but when it comes to the important stuff, I'll probably always write an obviously correct version first and then let the LLM try to make a faster/more capable version, that I can check against the correct version.
dcrazy · 1h ago
Humans are notoriously bad at checking for meticulous detail—hence why copyediting was a specialized role. We’ve seen what’s happened since the newspapers eliminated it for cost savings.
I only used an LLM for the first time recently, to rewrite a YouTube transcript into a recipe. It was excellent at the overall restructuring, but it made a crucial and subtle mistake. The recipe called for dividing 150g of sugar, adding 30g of cornstarch to one half, and blanching eggs in that mixture. ChatGPT rewrote it so that you blanched the eggs in the other half, without the cornstarch. This left me with a boiling custard that wasn’t setting up.
I did confirm that the YouTube transcript explicitly said to use the sugar and cornstarch mixture. But I didn’t do a side by side comparison because the whole reason for doing the rewrite is that transcripts are difficult to read!
libraryofbabel · 1h ago
> I'll probably always write an obviously correct version first
I’m not usually so confident in my own infallibility, so I prefer to think of it as “I might get this wrong, the LLM might get this wrong, our failure modes are probably not very correlated, so the best thing is for us both to do it and compare.”
Agree it is always better for the human engineer to try writing the critical code first, since they are susceptible to being biased by seeing the LLM’s attempt. Whereas you can easily hide your solution from the LLM.
tmnvdb · 1h ago
So similar to wikipedia
simianwords · 1h ago
Similar to anything really. Can I really trust anything without verifying? Scientific journals?
lblume · 1h ago
It seems that on some level, you have to in order to not just constantly reflecting upon your thoughts and researching facts. Whether you trust a given source should surely depend upon its reputation regarding the validity of its claims.
simianwords · 1h ago
I agree and by reputation you mean accuracy. We implicitly know not to judge anything as 100% true and implicitly apply skepticism towards sources - the skepticism is decided by our past experience with the sources.
Think of LLMs as the less accurate version of scientific journals.
lblume · 1h ago
Accuracy certainly does play a role, but this in itself is not sufficient for preventing an infinite regress – how does one determine the accuracy of a source if not by evaluating claims about the source, that themselves have sources that need to be checked for accuracy? Empirical inquiries are optimal but often very unpractical. Reputation is accuracy as imperfectly valued by society or specific social groups collectively.
oidar · 7m ago
If you have an Apple docs mcp, it does let you know and offers alternatives. So this is another “your holding it wrong” problem.
gdsys · 2h ago
"Based on my research, zstd compression is not natively supported by iOS or Apple's frameworks, which means you cannot use zstd compression without adding some form of external code to your project"
Most likely the key here is web search. Later I will try the post’s example with gpt-5 with search. I would be surprised if it didn’t say the same thing.
From a product point of view, it seems clear that just as they have work to get the model to dynamically decide to use reasoning when it would help, they have to do the same with web search.
Maro · 1h ago
I'm currently working as a hands-off VP, so I don't use LLMs for coding at work, only for emails and document editing. I do use it for my hobby weekend coding stuff, which usually involves writing short 100-1000 LOC Python toy programs for my own education/entertainment. My way-of-working is quite primitive, I have zero integrations, nothing agentic, I just copy/paste with ChatGPT.
For this use-case it's been very useful, it can usually generate close-to-complete solutions, as long as it's one of the major programming languages, and it's a reasonably standard problem. So in general I'm always surprised when people say that LLMs are completely useless for coding --- this is just not true, and I feel sorry for people who shut themselves off from a useful tool.
But even at this small scale, even the best (o3) models sometimes totally fail. Recently I started a series of posts on distributed algorithms [1], and when I was working on the post/code for the Byzantine Generals / Consensus algorithm, o3 --- to my honest surprise --- just totally failed. I tried about 10 different times (both from scratch and describing the incorrect behaviour of its code), also showing it the original Lamport paper, and it just couldn't get it right... even though the toy implementation is just ~100 LOC, the actual algorithm portion is maybe 25 LOC. My hypothesis is that there are very few implementations online, and additionally I find the descriptions of the algorithm a bit vague (interleaves message cascade and decision logic).
ChatGPT 5> Without adding third-party dependencies, how can I compress a Data stream with zstd in Swift on an iPhone?
(It auto-routed to its thinking model, took 25s)
> Short answer: you can’t. iOS doesn’t ship a public zstd encoder/decoder. The system Compression and AppleArchive frameworks only expose lzfse, lz4, zlib/deflate, and lzma. So with only Apple SDKs, there’s no way to produce zstd-compressed bytes.
>
> Two practical paths:
> 1. If zstd isn’t mandatory, use a built-in codec (lzfse or lz4 are fastest) via Compression’s streaming API. Example (pure Swift, no deps) that compresses an arbitrary byte stream:
>
> <code example>
> 2. If zstd is mandatory, you must embed zstd yourself. That can mean:
> - Vendoring the official C sources (libzstd) directly into your project and calling them via a tiny Swift wrapper/bridging header
> - Statically linking a prebuilt libzstd.a you compile for iOS
djeastm · 1h ago
Here's what I got with ChatGPT GPT-5, it must not have thought about it because it was near instantaneous:
>On iOS, you can use Apple’s built-in Zstandard (zstd) compression API from the Compression framework — no third-party dependencies required.
>Here’s how you can compress a Data stream with zstd:
>...
I think the useful takeaway here is that Top 1 operation is generally not a good idea, especially not for making judgements. This doesn't address the main points of the blog though.
jmkni · 2h ago
what's funny is that newer models will now be trained on the exact question, "Without adding third-party dependencies, how can I compress a Data stream with zstd in Swift on an iPhone?" and similar questions to it, because of this post
maybe the key to training future llm's is to write angry blog posts about the things they aren't good at and get them to the front page of hn?
nikolayasdf123 · 2h ago
good point. nobody knows you are a dog on internet anyways
quantum_state · 2h ago
An implication from the 1985 paper of Peter Naur on programming as theory building is that the current LLM coding tool would be very effective in generating technical debt even when it works ... use at your own risk.
"Short answer: you can’t. iOS doesn’t ship a Zstandard (zstd) encoder/decoder in any first-party framework. Apple’s built-in Compression framework supports LZFSE, LZ4, zlib/deflate, and LZMA—not zstd."
tptacek · 1h ago
LLMs can be a useful tool, maybe. But don’t anthropomorphize them.
(but, earlier)
If a tool is able to actively mislead me this easy, which potentially results in me wasting significant amounts of time in trying to make something work that is guaranteed to never work, it’s a useless tool. I don’t like collaborating with chronic liars.
lblume · 1h ago
With chronic liar being a severe anthropomorphization by itself due to assuming some level of intent by the LLM, correct?
tptacek · 28m ago
Yes. They had it right at the end of the piece, when they said "don't anthropomorphize". LLMs don't "lie".
nikolayasdf123 · 2h ago
possible solution: "reality checks"
I see that GitHub Copilot actually runs code, writes simple exploratory programs, iteratively tests its hypothesis. it is astoundingly effective and fast.
same here. nothing stops this AI to actually trying to implement whatever this AI suggested, compile it, and see if this is actually works.
grounding in reality at inference time, so to speak.
nikolayasdf123 · 2h ago
> “Not having an answer” is not a possibility in this system - there’s always “a most likely response”, even if that makes no sense.
simple fix - probability cutoff. but in all seriousness this is something that will be fixed. don't see fundamental reason why not.
and I myself seen such hallucinations (about compression too actually) as well.
Seb-C · 1h ago
Hallucinations are not a bug or an exception, but a feature. Everything outputted by LLMs is 100% made-up, with a heavy bias towards what has been fed to it at first (human written content).
The fundamental reason why it cannot be fixed is because the model does not know anything about the reality, there is simply no such concept here.
To make a "probability cutoff" you first need a probability about what the reality/facts/truth is, and we have no such reliable and absolute data (and probably never will).
simianwords · 1h ago
>To make a "probability cutoff" you first need a probability about what the reality/facts/truth is, and we have no such reliable and absolute data (and probably never will).
Can a human give a probability estimate to their predictions?
tmnvdb · 1h ago
You use a lot of anthropomorphisms: doesn't "know" anything (does your hard drive know things? Is it relevant?), "making things up" is even more linked to continuous intent. Unless you believe the LLMs are sentient this is a strange choice of words.
Seb-C · 55m ago
I originally put quotes around "know" and somehow lost it in an edit.
I'm precisely trying to criticize the claims of AGI and intelligence. English is not my native language, so nuances might be wrong.
I used the word "makes-up" in the sense of "builds" or "constructs" and did not mean any intelligence there.
nikolayasdf123 · 1h ago
have you seen Iris flowers dataset? it is fairly simple to find cutoffs to classify flowers.
or are you claiming in general that there is no objective truth in reality in philosophical sense? well, you can go on that more philosophical side of the road, or you can get more pragmatic. things just work, regardless how we talk about them.
Seb-C · 1h ago
I don't mean it in a philosophical sense, more in a rigorous scientific one.
Yes, we do have reliable datasets as in your example, but those are for specific topics and are not based on natural language. What I would call "classical" machine learning is already a useful technology where it's applied.
Jumping from separate datasets focused on specific topics to a single dataset describing "everything" at once is not something we are even close to doing, if it's even possible. Hence the claim of having a single AI able to answer anything is unreasonable.
The second issue is that even if we had such a hypothetical dataset, ultimately if you want a formal response from it, you need a formal question and a formal language (probably something between maths and programming?) in all the steps of the workflow.
LLMs are only statistical models about natural languages, so it's the antithesis of this very idea. To achieve that would have to be a completely different technology that has yet to even be theoretized.
zhivota · 1h ago
I call this genre of article "I'm too lazy to think about how to deal with the fact that LLMs are sometimes wrong, especially when they aren't using thinking and are given short, leading prompts."
Ok great! For those of us who aren't too lazy for it, LLMs are providing a lot of value right now.
add-sub-mul-div · 46m ago
I'm not sure that the people who still prefer to do their own work rather than circuitously delegate it are the ones that seem lazy.
blizdiddy · 1h ago
Finding a useless way to use a tool is neither interesting or novel. I hit my thumb with a hammer, useless tool.
overgard · 1h ago
My issue with LLMs isnt really the tech itself. They're occassionally useful, although HOW useful seems to depend on a persons skillset.
My main beef with the AI hype is that it's allowing a lot of idiots to significantly devalue our profession in a really noxious and irritating way to people that generally don't understand what we do but would like to pay us less or pay less of us. I'm annoyed at other software developers that don't seem to see how harmful this will be for us when the insane investment bubble bursts and AI becomes a lot more expensive to use. We will probably have lost a generation of junior developers who have become dependent on a suddenly expensive tool. And execs will just think the seniors need to pick up the slack. And expectations on AI will be a lot higher when the subscription is more like 200 or 2000 a month.
And that's just for coding! I'd be furious if I was an artist and generative AI was being trained on my portfolio to plagiarize my work. (Badly)
What I never see justified is why any of this is good for society. At best it lets billionionaires save some money by getting rid of jobs, or vibe coders pretend they can build a product until they hit a wall where real understanding is neccessary. If you follow the trail of who is supposed to benefit from these things its not many of us. If AI were to disappear today I don't think my life would be any worse.
viach · 1h ago
I think people often misread AGI as Artificial God Intelligence.
johnfn · 1h ago
I find it hard to empathize with these sorts of articles. I think in, in the spate of GPT-5 related content recently, I've been seeing a lot of articles that boil down to "I tried GPT-5 on a single hard question, and it gave an answer. This proves that all LLMs are useless." And I don't think I'm distorting the author's viewpoint. He goes on to say "This concludes all the testing for GPT5 I have to do" after conducting a single test.
This seems like particularly harsh criteria; what would happen if I applied this to other tools?
- I used Typescript, but it missed a bug that crashed prod, so it is "absolute horseshit"
- I used Rust, but one of my developers added an unsafe block, so it's trash.
cantor_S_drug · 2h ago
There was a writer who in order to get ideas to write about used to cut up words from newspaper headlines and then rearrange them.
In one rearrangement, he got "Son sues father for xyz". That headline came true 2 years later.
lblume · 1h ago
How is this relevant? Modern LLMs don't rearrange words at all in a meaningful sense of the word, and are certainly better than just using random chance to sample tokens.
cantor_S_drug · 1h ago
Indeed it is exactly that process. They cutup words. Then categorize them based on some metric of nearness (not random). Then link them up. Obviously this process is much more sophisticated than what I have described here.
techpineapple · 4h ago
I wonder if one reasons new versions of GPT appear to get better - say at coding tasks is just because they have new knowledge.
When ChatGPT4 comes out, new versions of API’s will have less blog post / examples / documentation in their training data. So ChatGPT 5 comes out and seems to solve all the problems that ChatGPT4 had, but then of course fail on newer libraries. Rinse and repeat
its-kostya · 3h ago
> ... just because they have new knowledge.
This means there is a future where AI is training on data it self generated, and I worry that might not be sustainable.
jgalt212 · 2h ago
A software based Habsburg Jaw if you will.
lazide · 2h ago
This is already occurring, is not sustainable, and produces an effect known as Model Collapse.
techpineapple · 3h ago
I’ve heard of this idea of training on synthetic data, I wonder what is that data and does this increase or decrease hallucinations? Is the goal of training on synthetic data to better wear certain paths, or to increase the amount of knowledge / types of data.
Because the second seems vaguely impossible to do.
> it’s “just a statistical model” that generates “language” based on a chain of “what is most likely to follow the previous phrase”
Humans are statistical models too in an appropriate sense. The question is whether we try to execute phrase by phrase or not, or whether it even matters what humans do in the long term.
> The only way ChatGPT will stop spreading that nonsense is if there is a significant mass of humans talking online about the lack of ZSTD support.
Or you can change the implicit bias in the model by being more clever with your training procedure. This is basic stats here, not everything is about data.
> They don’t know anything, they don’t think, they don’t learn, they don’t deduct. They generate real-looking text based on what is most likely based on the information it has been trained on.
This may be comforting to think, but it's just wrong. It would make my job so much easier if it were true. If you take the time to define "know", "think", and "deduct", you will find it difficult to argue current LLMs do not do these things. "Learn" is the exception here, and is a bit more complex, not only because of memory and bandwidth issues, but also because "understand" is difficult to define.
LLMs know so much (when you just use ChatGPT for the first time like it's an Oracle machine) -> LLMs don't know anything (when you understand how machine learning works) -> LLMs know so much (when you actually think about what 'know' means)
If in 2022 I’d tried to convince AI skeptics that in three years we might have tools on the level of Claude Code, I’m sure I’d have heard everyone say it would be impossible because “it’s just a statistical model.” But it turned out that there was a lot more potential in the architecture for encoding structured knowledge, complex reasoning, etc., despite that architecture being probabilistic. (Don’t bet against the Bitter Lesson.)
LLMs have a lot of problems, hallucination still being one of them. I’d be the first to advocate for a skeptical hype-free approach to deploying them in software engineering. But at this point we need careful informed engagement with where the models are at now rather than cherry-picked examples and rants.
And when what you usually work on actually is very simple and mostly mindless, you'd probably benefit more from doing it yourself, so you can progress above the junior stuff one day.
Reason? Maybe. But there's one limitation that we currently have no idea how to overcome; LLMs don't know how much they know. If they tell you they don't something it may be a lie. If they tell you they do, this may be a lie too. I, a human, certainly know what I know and what I don't and can recall from where I know the information
For example, if I use the language of my expertise for a familiar project then the boundaries where the challenges might lie are known. If I start learning a new language for the project I won't know which areas might produce unknowns.
The LLM will happily give you code in a language it's not trained well on. With the same confidence as using any other language.
> Mathematicians used to comb through model solutions because earlier systems would quietly flip an inequality or tuck in a wrong step, creating hallucinated answers.
> Brown says the updated IMO reasoning model now tends to say “I’m not sure” whenever it lacks a valid proof, which sharply cuts down on those hidden errors.
> TLDR, the model shows a clear shift away from hallucinations and toward reliable, self‑aware reasoning.
Source: https://x.com/chatgpt21/status/1950606890758476264
No, we aren't and I'm getting tired of this question begging and completely wrong statement. Human beings are capable of what Kant in fancy words called "transcendental apperception", we're already bringing our faculties to bear on experience without which the world would make no sense to us.
What that means in practical terms for programming problems of this kind is that, we can say "I don't know", which the LLM can't, because there's no "I", in the LLM, no unified subject that can distinguish what it knows and what it doesn't, what's within its domain of knowledge or outside.
>If you take the time to define "know", "think", and "deduct", you will find it difficult to argue current LLMs do not do these things
No, only if you don't spend the time to think about what knowledge is you'd make such a statement. What enables knowledge, which is not raw data but synthesized, structured cognition, is the faculties of the mind a priori categories we bring to bear on data.
That's why these systems are about as useless as a monkey with a typewriter when you try to have them work on manual memory management in C, because that's less of a task in auto completion and requires you to have in your mind a working model of the machine.
[1]: https://en.wikipedia.org/wiki/Multiple_drafts_model
There is no such thing as consciousness in Dennett's theory, his position is that it doesn't exist, he is a Eliminativist. This is of course an absurd position with no evidence for it as people like Chalmers have pointed out (including in that Wikipedia article), and it might be the most comical and ideological position in the last 200 years.
If you can come up with a symbolic description of a deficiency in how LLMs approach problems, that's fantastic, because we can use that to alter how these models are trained, and how we approach problems too!
> What that means in practical terms for programming problems of this kind is that, we can say "I don't know", which the LLM can't, because there's no "I", in the LLM, no unified subject that can distinguish what it knows and what it doesn't, what's within its domain of knowledge or outside.
We seriously don't know whether there is an "I" that is comprehended or not. I've seen arguments either way. But otherwise, this seems to refer to poor internal calibration of uncertainty, correct? This is an important problem! (It's also a problem with humans too, but I digress). LLMs aren't nearly as bad as this as you might think, and there are a lot of things you can do (that the big tech companies do not do) that can better tune it's own self-confidence (as reflected in logits). I'm not aware of anything that uses this information as part of the context, so that might be a great idea. But on the other hand, maybe this actually isn't as important as we think it is.
A theory gives 100% correct predictions. Although the theory itself may not model the world accurately. Such feedback between the theory, and its application in the world causes iterations to the theory. From newtonian mechanics to relativity etc.
Long story short, the LLM is a long way away from any of this. And to be fair to LLMs, the average human is not creating theories, it takes some genius to create them (newton, turing, etc).
Understanding something == knowing the theory of it.
What made you believe this is true? Like it or not, yes, they do (at least to the best extent of our definitions of what you've said). There is a big body of literature exploring this question, and the general consensus is that all performant deep learning models adopt an internal representation that can be extracted as a symbolic representation.
I am yet to see a theory coming of the LLM that is sufficiently interesting. My comment was answering your question of what does it mean to "understanding something". My answer to that is: understanding something is knowing the theory of it.
Now, that begs the question of what is a theory. And to answer that, a theory comprises of building block symbols and a set of rules to combine them. for example, building blocks for space (and geometry) could be points, lines, etc. The key point in all of this is symbolism as abstractions to represent things in some world.
> The key point in all of this is symbolism as abstractions to represent things in some world.
The difficulty is understanding how to extract this information from the model, since the output of the LLM is actually a very poor representation of its internal state.
Yeah, except it isn't. You can get enormous value out of LLMs if you get over this weird science fiction requirement that they never make mistakes.
And yeah, their confidence is frustrating. Treat them like an over-confident twenty-something intern who doesn't like to admit when they get stuff wrong.
You have to put the effort in to learn how to use them with a skeptical eye. I've been getting value as a developer from LLMs since the GPT-3 era, and those models sucked.
> The only way ChatGPT will stop spreading that nonsense is if there is a significant mass of humans talking online about the lack of ZSTD support.
We actually have a robust solution for this exact problem now: run the prompt through a coding agent of some sort (Claude Code, Codex CLI, Cursor etc) that has access to the Swift compiler.
That way it can write code with the hallucinated COMPRESSION_ZSTD thing in it, observe that it doesn't compile and iterate further to figure out what does work.
Or the simpler version of the above: LLM writes code. You try and compile it. You get an error message and you paste that back into the LLM and let it have another go. That's been the main way I've worked with LLMs for almost three years now.
I use AI for sure, but only on things that I can easily verify is correct (run a test or some code ), because I have had the AI give me functions in an API with links to online documentation for those functions, the document exists, the function is not in it, when called out instead of doing a basic tool call the AI will double down that it is correct and you the human are wrong. That would get an intern fired but here you are standing on the interns side.
I wrote a note about that here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#s...
> Don’t fall into the trap of anthropomorphizing LLMs and assuming that failures which would discredit a human should discredit the machine in the same way.
I’m going to comment here about this but it’s a follow on to the other comment, this is exactly the workflow I was following. I had given it the compiler error and it blamed an environment issue, I confirmed the environment is as it claims it should be, it linked to documentation that doesn’t state what it claims is stated.
In a coding agent this would have been an endless feedback loop that eats millions of tokens.
This is the reason why I do not use coding agents, I can catch hallucinations and stop the feedback loop from ever happening in the first place without needing to watch an AI agent try to convince itself that it is correct and the compiler must be wrong.
I was explicitly calling out this comment, that intern would get fired if when explicitly called out they not only don’t want to admit they are wrong but vehemently disagree.
The interaction was “Implement X”, it gave an implementation, I responded “function y does not exist use a different method”, it instead of following that instruction gave me a link to the documentation for the library that it claim’s contains that function and told me I am wrong.
I said the documentation it linked does not contain that function and to do something different and yet it still refused to follow instructions and pushed back.
At that point I “fired” it and wrote the code myself.
I think a more correct take here might be "it's a tool that I don't trust enough to use without checking," or at the very least, "it's a useless tool for my purposes." I understand your point, but I got a little caught up on the above line because it's very far out of alignment with my own experience using it to save enormous amounts of time.
Coding agents have now got pretty good at checking themselves against reality, at least for things where they can run unit tests or a compiler to surface errors. That would catch the error in TFA. Of course there is still more checking to do down the line, in code reviews etc, but that goes for humans too. (This is not to say that humans and LLMs should be treated the same here, but nor do I treat an intern’s code and a staff engineer’s code the same.) It’s a complex issue that we can’t really collapse into “LLMs are useless because they get things wrong sometimes.”
A whole lot of my schooling involved listening to teachers repeating over and over to us how we should check our work, because we can't even trust ourselves.
(heck, I had to double-check and fix typos in this comment)
Famous last words. Checking trivial code for trivial bugs, yes. In science, you can have very subtle bugs that bias your results in ways that aren't obvious for a while until suddenly you find yourself retracting papers.
I've used LLMs to write tedious code (that should probably have been easier if the right API had been thought through), but when it comes to the important stuff, I'll probably always write an obviously correct version first and then let the LLM try to make a faster/more capable version, that I can check against the correct version.
I only used an LLM for the first time recently, to rewrite a YouTube transcript into a recipe. It was excellent at the overall restructuring, but it made a crucial and subtle mistake. The recipe called for dividing 150g of sugar, adding 30g of cornstarch to one half, and blanching eggs in that mixture. ChatGPT rewrote it so that you blanched the eggs in the other half, without the cornstarch. This left me with a boiling custard that wasn’t setting up.
I did confirm that the YouTube transcript explicitly said to use the sugar and cornstarch mixture. But I didn’t do a side by side comparison because the whole reason for doing the rewrite is that transcripts are difficult to read!
I’m not usually so confident in my own infallibility, so I prefer to think of it as “I might get this wrong, the LLM might get this wrong, our failure modes are probably not very correlated, so the best thing is for us both to do it and compare.”
Agree it is always better for the human engineer to try writing the critical code first, since they are susceptible to being biased by seeing the LLM’s attempt. Whereas you can easily hide your solution from the LLM.
Think of LLMs as the less accurate version of scientific journals.
Thanks Sonnet.
Full response:
https://www.perplexity.ai/search/without-adding-third-party-...
From a product point of view, it seems clear that just as they have work to get the model to dynamically decide to use reasoning when it would help, they have to do the same with web search.
For this use-case it's been very useful, it can usually generate close-to-complete solutions, as long as it's one of the major programming languages, and it's a reasonably standard problem. So in general I'm always surprised when people say that LLMs are completely useless for coding --- this is just not true, and I feel sorry for people who shut themselves off from a useful tool.
But even at this small scale, even the best (o3) models sometimes totally fail. Recently I started a series of posts on distributed algorithms [1], and when I was working on the post/code for the Byzantine Generals / Consensus algorithm, o3 --- to my honest surprise --- just totally failed. I tried about 10 different times (both from scratch and describing the incorrect behaviour of its code), also showing it the original Lamport paper, and it just couldn't get it right... even though the toy implementation is just ~100 LOC, the actual algorithm portion is maybe 25 LOC. My hypothesis is that there are very few implementations online, and additionally I find the descriptions of the algorithm a bit vague (interleaves message cascade and decision logic).
[1] https://bytepawn.com/tag/distributed.html
(It auto-routed to its thinking model, took 25s)
> Short answer: you can’t. iOS doesn’t ship a public zstd encoder/decoder. The system Compression and AppleArchive frameworks only expose lzfse, lz4, zlib/deflate, and lzma. So with only Apple SDKs, there’s no way to produce zstd-compressed bytes.
> > Two practical paths:
> 1. If zstd isn’t mandatory, use a built-in codec (lzfse or lz4 are fastest) via Compression’s streaming API. Example (pure Swift, no deps) that compresses an arbitrary byte stream: >
> <code example>
> 2. If zstd is mandatory, you must embed zstd yourself. That can mean:
> - Vendoring the official C sources (libzstd) directly into your project and calling them via a tiny Swift wrapper/bridging header
> - Statically linking a prebuilt libzstd.a you compile for iOS
>On iOS, you can use Apple’s built-in Zstandard (zstd) compression API from the Compression framework — no third-party dependencies required.
>Here’s how you can compress a Data stream with zstd: >...
https://chatgpt.com/share/68976c8f-7ae0-8012-b7a8-58e016246d...
maybe the key to training future llm's is to write angry blog posts about the things they aren't good at and get them to the front page of hn?
"Short answer: you can’t. iOS doesn’t ship a Zstandard (zstd) encoder/decoder in any first-party framework. Apple’s built-in Compression framework supports LZFSE, LZ4, zlib/deflate, and LZMA—not zstd."
(but, earlier)
If a tool is able to actively mislead me this easy, which potentially results in me wasting significant amounts of time in trying to make something work that is guaranteed to never work, it’s a useless tool. I don’t like collaborating with chronic liars.
I see that GitHub Copilot actually runs code, writes simple exploratory programs, iteratively tests its hypothesis. it is astoundingly effective and fast.
same here. nothing stops this AI to actually trying to implement whatever this AI suggested, compile it, and see if this is actually works.
grounding in reality at inference time, so to speak.
simple fix - probability cutoff. but in all seriousness this is something that will be fixed. don't see fundamental reason why not.
and I myself seen such hallucinations (about compression too actually) as well.
The fundamental reason why it cannot be fixed is because the model does not know anything about the reality, there is simply no such concept here.
To make a "probability cutoff" you first need a probability about what the reality/facts/truth is, and we have no such reliable and absolute data (and probably never will).
Can a human give a probability estimate to their predictions?
I'm precisely trying to criticize the claims of AGI and intelligence. English is not my native language, so nuances might be wrong.
I used the word "makes-up" in the sense of "builds" or "constructs" and did not mean any intelligence there.
or are you claiming in general that there is no objective truth in reality in philosophical sense? well, you can go on that more philosophical side of the road, or you can get more pragmatic. things just work, regardless how we talk about them.
Yes, we do have reliable datasets as in your example, but those are for specific topics and are not based on natural language. What I would call "classical" machine learning is already a useful technology where it's applied.
Jumping from separate datasets focused on specific topics to a single dataset describing "everything" at once is not something we are even close to doing, if it's even possible. Hence the claim of having a single AI able to answer anything is unreasonable.
The second issue is that even if we had such a hypothetical dataset, ultimately if you want a formal response from it, you need a formal question and a formal language (probably something between maths and programming?) in all the steps of the workflow.
LLMs are only statistical models about natural languages, so it's the antithesis of this very idea. To achieve that would have to be a completely different technology that has yet to even be theoretized.
Ok great! For those of us who aren't too lazy for it, LLMs are providing a lot of value right now.
My main beef with the AI hype is that it's allowing a lot of idiots to significantly devalue our profession in a really noxious and irritating way to people that generally don't understand what we do but would like to pay us less or pay less of us. I'm annoyed at other software developers that don't seem to see how harmful this will be for us when the insane investment bubble bursts and AI becomes a lot more expensive to use. We will probably have lost a generation of junior developers who have become dependent on a suddenly expensive tool. And execs will just think the seniors need to pick up the slack. And expectations on AI will be a lot higher when the subscription is more like 200 or 2000 a month.
And that's just for coding! I'd be furious if I was an artist and generative AI was being trained on my portfolio to plagiarize my work. (Badly)
What I never see justified is why any of this is good for society. At best it lets billionionaires save some money by getting rid of jobs, or vibe coders pretend they can build a product until they hit a wall where real understanding is neccessary. If you follow the trail of who is supposed to benefit from these things its not many of us. If AI were to disappear today I don't think my life would be any worse.
This seems like particularly harsh criteria; what would happen if I applied this to other tools?
- I used Typescript, but it missed a bug that crashed prod, so it is "absolute horseshit"
- I used Rust, but one of my developers added an unsafe block, so it's trash.
In one rearrangement, he got "Son sues father for xyz". That headline came true 2 years later.
When ChatGPT4 comes out, new versions of API’s will have less blog post / examples / documentation in their training data. So ChatGPT 5 comes out and seems to solve all the problems that ChatGPT4 had, but then of course fail on newer libraries. Rinse and repeat
This means there is a future where AI is training on data it self generated, and I worry that might not be sustainable.
Because the second seems vaguely impossible to do.