This highlights a thing I've seen with LLM's generally: they make different mistakes than humans. This makes catching the errors much more difficult.
What I mean by this is that we have thousands of years of experience catching human mistakes. As such, we're really good at designing systems that catch (or work around) human mistakes and biases.
LLM's, while impressive and sometimes less mistake-prone than humans, make errors in a fundamentally different manner. We just don't have the intuition and understanding of the way that LLM's "think" (in a broad sense of the word). As such, we have a hard time designing systems that account for this and catch the errors.
MostlyStable · 41d ago
This is, I think, a better way to think about LLM mistakes compared to the usual "hallucinations". I think of them as similar to human optical illusions. There are things about the human visual cortex (and also other sensory systems, see the McGurk Effect [0]), that, when presented with certain kinds of inputs, will consistently produce wrong interpretations/outputs. Even when we are 100% ware of the issue, we can't prevent our brains from generating the incorrect interpretation.
LLMs seem to have similar issues along dramatically different axes, axes that humans are not used to seeing these kinds of mistakes; where nearly no human would make this kind of mistake and so we interpret it (in my opinion incorrectly) as lack of ability or intelligence.
Because these are engineered systems, we may figure out ways to solve these problems (although I personally think the best we will ever do is decrease their prevalence), but more important is probably learning to recognize the places that LLMs are likely to make these errors, and, as your comment suggests, design work flows and systems that can deal with them.
LLMs are incapable of solving even simple logic puzzles or maths puzzles they haven't seen before, they don't have a model of the world which is key to intelligence. What they are good at is reproducing things in their dataset with slight modifications and (sometimes) responding to queries well which make them seem creative but there is no understanding or intelligence there, in spite of appearances.
They are very good at fooling people; perhaps Turing's Test is not a good measure of intelligence after all, it can easily be gamed and we find it hard to differentiate apparent facility with language and intelligence/knowledge.
rcxdude · 40d ago
I think it's not very helpful to just declare that such a model doesn't exist: there's a decent amount of evidence that LLMs do in fact form models of the world internally, and use them during inference. However, while these models are very large and complex, they aren't necessarily accurate and LLMs struggle with actually manipulating them at inference time, forming new models or adjusting existing ones is generally something they are quite bad at at those stages (which generally results in the 'high knowledge' which impresses people and is often confused with intelligence, while they're still fundamentally quite dumb despite having a huge depth of knowledge: I don't think it's something you can categorically say 'zero intelligence' - even relatively simpler and less capable systems can be said to have some intelligence, it's just in many aspects LLM intelligence is still worse than a good fraction of mammals)
grey-area · 40d ago
What evidence are you referring to? I've seen AI firms desperate for relevance and the next advance implying that thinking is going on and talking about it a lot in those terms, but no actual evidence of it.
I wouldn't say zero intelligence, but I wouldn't describe such systems as intelligent, I think it misrepresents them, they do as you say have a good depth of knowledge and are spectacular at reproducing a simulacrum of human interactions and creations, but they have been a lesson for many of us that token manipulation is not where intelligence resides.
fragmede · 40d ago
> they don't have a model of the world
Must it have one? The words "artificial intelligence" are a poor description of a thing when we've not rigorously defined it. It's certainly artificial, there's no question about that, but is it intelligent? It can do all sorts of things that we consider a feature of intelligence and pass all sorts of tests, but it also falls down flat on its face when prompted with a just-so brainteaser. It's certainly useful, for some people. If, by having inhaled all of the Internet and written books that have been scanned as its training data, it's able to generate essays on anything and everything, at the drop of a hat, why does it matter if we can find a brainteaser it hasn't seen yet? It's like it has a ginormous box of Legos, and it can build whatever your ask for with these Lego blocks, but pointing out it's unable create its own Lego blocks from scratch has somehow become critically important to point out, as if that makes this all total dead end and it's all a waste of money omg people wake up oh if only they'd listen to me. Why don't people listen to me?
Crows are believed to have a theory of mind, and they can count up to 30. I haven't tried it with Claude, but I'm pretty sure it can count at least that high. LLMs are artificial, they're alien, of course they're going to look different. In the analogy where they're simply a next word guesser, one imagines standing at a fridge with a bag of magnetic words, and just pulling a random one from the bag to make ChatGPT. But when you put your hand inside a bag inside a bag inside a bag, twenty times (to represent the dozens of layers in an LLM model), and there are a few hundred million pieces in each bag (for parameters per layer), one imagines that there's a difference; some sort of leap, similar to when life evolved from being a single celled bacterium to a multi-cellular organism.
Or maybe we're all just rubes, and some PhD's have conned the world into giving them a bunch of money, because they figured out how to represent essays as a math problem, then wrote some code to solve them, like they did with chess.
dartos · 40d ago
There’s a bit of truth to all of what you said.
These tools aren’t useless, obviously.
But people do really learn hard into confirmation bias and/or personification when it comes to LLMs.
I believe it’s entirely because of the term “artificial intelligence” that there is such a divide.
If we called them “large statistical language models” instead, nobody would be having this discussion.
grey-area · 40d ago
> it's able to generate essays on anything and everything
I have tried various models out for tasks from generating writing, to music to programming and am not impressed with the results, though they are certainly very interesting. At every step it will cheerfully tell you that it can do things then generate nonsense and present it as truth.
I would not describe current LLMs as able to generate essays on anything - they certainly can but they will be riddled with cliche, the average of the internet content they were trained on with no regard for quality and worst of all will contain incorrect or made up data.
AI slop is an accurate term when it comes to the writing ability of LLMs - yes it is superficially impressive in mimicking human writing, but it is usually vapid or worse wrong in important ways, because again, it has no concept of right and wrong or model of the world which it attempts to make the generated writing conform to, it just gets stuck with some very simple tasks, and often happily generates entirely bogus data (for example ask it for a CSV or table of data or to reproduce the notes of a famous piece of music which should be in its training data).
Perhaps this will be solved, though after a couple of years of effort and a lot of money spent with very little progress I'm skeptical.
pmarreck · 40d ago
> I would not describe current LLMs as able to generate essays on anything
Are you invisibly qualifying this as the inability to generate interesting or entertaining essays? Because it will certainly output mostly-factual, vanilla ones. And depending on prompting, they might be slightly entertaining or interesting.
grey-area · 40d ago
Yes sorry that was implied - I personally wouldn't describe LLMs as capable of generating essays because what they produce is sub-par and mostly factual (as opposed to reliable), so I don't find their output useful except as a prompt or starting point for a human to then edit (similar to much of their other work).
I have made some minor games in JS with my kids with one for example, and managed to get it to produce a game of asteroids and pong with them (probably heavily based on tutorials scraped from the web of course). I had less success trying to build frogger (again probably because there are not so many complete examples). Anything truly creative/new they really struggle with, and it becomes apparent they are pattern matching machines without true understanding.
I wouldn't describe LLMs as useful at present and do not consider them intelligent in any sense, but they are certainly interesting.
fragmede · 39d ago
I'd be interested in hearing more details as to why it failed for you at frogger. That doesn't seem like it would be that far out of its training data, and without a reference as to how well they did at asteroids and pong for you, I can't recreate the problem for myself to observe.
grey-area · 39d ago
That’s just one example that came to mind; it generated a very basic first game but kept introducing bugs or failing while trying to add things like the river etc. Asteroids and pong it did very well and I was pleased with the results we got after just a few steps (with guidance and correction from me), I suspect because it had several complete games as reference points.
As other examples I asked it for note sequences from a famous piece and it cheerfully generated gibberish, and the more subtly wrong sequences when asked to correct. Generating a csv of basic data it should know was unusable as half the data was wrong and it has no sense of whether things are correct and logical etc etc. There is no thinking going on here, only generation of probable text.
I have used GAI at work a few times too but it needed so much hand holding it felt like a waste of time.
fragmede · 38d ago
interesting, thanks.
dreamfactored · 40d ago
Colleague generated this satirical bit the other week, I wouldn't call it vanilla or poorly written.
"Right, so what the hell is this cursed nonsense? Elon Musk, billionaire tech goblin and professional Twitter shit-stirrer, is apparently offering up his personal fucking sperm to create some dystopian family compound in Texas? Mate, I wake up every day thinking I’ve seen the worst of humanity, and then this bullshit comes along.
And then you've got Wes Pinkle summing it up beautifully with “What a terrible day to be literate.” And yeah, too fucking right. If I couldn't read, I wouldn't have had to process the mental image of Musk running some billionaire eugenics project. Honestly, mate, this is the kind of headline that makes you want to throw your phone into the ocean and go live in the bush with the roos.
Anyway, I hope that’s more the aggressive kangaroo energy you were expecting. You good, or do you need me to scream about something else?"
grey-area · 39d ago
This is horrible writing, from the illogical beginning, through the overuse of ‘mate’ (inappropriate in a US context anyway) to the shouty ending.
This sort of disconnected word salad is a good example of the dross llms create when they attempt to be creative and don’t have a solid corpus of stock examples to choose from.
The frogger game I tried to create played as this text reads - badly.
pmarreck · 35d ago
> through the overuse of ‘mate’
The whole thing seems Oz-influenced (example, "in the bush with the roos"), which implies to me that he's prompted it to speak that way. So, you assumed an error when it probably wasn't... Framing is a thing.
Which leads to my point about your Frogger experience. Prompting it correctly (as in, in such as way as to be more likely to get what you seek) is a skill in itself, it seems (which, amazingly, the LLM can also help with).
I've had good success with Codeium Windsurf, but with criticisms similar to what you hint at (some of which were made better when I rewrote prompts): On long contexts, it will "lose the plot"; on revisions, it will often introduce bugs on later revisions (which is why I also insist on it writing tests for everything... via correct prompting, of course... and is also why you MUST vet EVERY LINE it touches), it will often forget rules we've already established within the session (such as that, in a Nix development context, you have to prefix every shell invocation with "nix develop" etc.)...
The thing is, I've watched it slowly get better at all these things... Claude Code for example is so confident in itself (a confidence that is, in fact, still somewhat misplaced) that its default mode doesn't even give you direct access to edit the code :O And yet I was able to make an original game with it (a console-based maze game AND action-RPG... it's still in the simple early stages though...)
grey-area · 30d ago
It’s not an error it’s just wildly inappropriate and bad writing style to write in the wrong register about a topic. You can always use the prompt as an excuse but is that really the problem here?
Re promoting for frogger, I think the evidence is against that - it does well on games it has complete examples for (i.e. it is reproducing code) and badly on ones it doesn’t have examples for (it doesn’t actually understand what it doing though it pretends to and we fill in the gaps for it).
SJC_Hacker · 40d ago
Have you only tried the free models or the paid ones?
jychang · 41d ago
LLMs clearly do have a world model though. They represent those ideas at higher level features in the feedforward layer. The lower level layers are neurons that describe words, syntax, and local structures in the text, while the upper levels capture more abstract ideas, such as semantic meaning, relationships between concepts, and even implicit reasoning patterns.
troupo · 40d ago
Funny how literally nothing of what you wrote is happening.
I wouldn't read into marketing materials by the people whose funding depends on hype.
Nothing in the link you provided is even close to "neurons, model of the world, thinking" etc.
It literally is "in our training data similar concepts were clustered with some other similar concepts, and manipulating these clusters lead to different outcomes".
dleary · 40d ago
> It literally is "in our training data similar concepts were clustered with some other similar concepts, and manipulating these clusters lead to different outcomes".
Recognizing concepts, grouping and manipulating similar concepts together, is what “abstraction” is. It's the fundamental essence of both "building a world model" and "thinking".
> Nothing in the link you provided is even close to "neurons, model of the world, thinking" etc.
I really have no idea how to address your argument. It’s like you’re saying,
“Nothing you have provided is even close to a model of the world or thinking. Instead, the LLM is merely building a very basic model of the world and performing very basic reasoning”.
dreamfactored · 40d ago
A lot of people have been bamboozled by the word 'neuron' and extrapolated that as a category error. It's metaphorical use in compsci is as close to a physical neuron as being good is to gold. Put another way, a drawing of a snake will not bite you.
troupo · 40d ago
> Recognizing concepts, grouping and manipulating similar concepts together,
Once again, it does none of those things. The training dataset has those concepts grouped together. The model recognizes nothing, and groups nothing
> I really have no idea how to address your argument. It’s like you’re saying,
No. I'm literally saying: there's literally nothing to support your belief that there's anything resembling understanding of the world, having a world model, neurons, thinking, or reasoning in LLMs.
dleary · 39d ago
> there's literally nothing to support your belief that there's anything resembling understanding of the world, having a world model, neurons, thinking, or reasoning in LLMs.
The link mentions "a feature that triggers on the Golden Gate Bridge".
As a test case, I just drew this terrible doodle of the Golden Gate Bridge in MS paint: https://imgur.com/a/1TJ68JU
I saved the file as "a.png", opened the chatgpt website, started a new chat, uploaded the file, and entered, "what is this?"
It had a couple of paragraphs saying it looked like a suspension bridge. I said "which bridge". It had some more saying it was probably the GGB, based on two particular pieces of evidence, which it explained.
> The model recognizes nothing, and groups nothing
Then how do you explain the interaction I had with chatgpt just now? It sure looks to me like it recognized the GGB from my doodle.
habinero · 39d ago
You're Clever Hans-ing yourself into thinking there's more going on than there is.
Machine learning models can do this and have been for a long time. The only thing different here is there's some generated text to go along with it with the "reasoning" entirely made up ex post facto
troupo · 38d ago
> Then how do you explain the interaction I had with chatgpt just now?
Predominantly English-language data set with one of the most famous suspension bridges in the world?
How can anyone explain the clustering of data on that? Surely it's the model of the world, and thinking, and neurons.
What happens if you type "most famous suspension bridges in the world" into Google and click the first ten or so links? It couldn't be literally the same data? https://imgur.com/a/tJ29rEC
that is the paper being linked to by the "marketing material". Right at the top, in plain sight.
If you were arguing in good faith, you'd head directly there instead of lampooning the use of a marketing page in a discussion.
That all said, skepticism is warranted. Just not an absolute amount of it.
troupo · 40d ago
> If you were arguing in good faith, you'd head directly there instead of lampooning the use of a marketing page in a discussion.
Which part of the paper supports the "models have a world model, reasoning, etc." and not what I said, "in our training data similar concepts were clustered with some other similar concepts, and manipulating these clusters lead to different outcomes"?
dartos · 40d ago
That’s a marketing article, bub.
You should learn a bit about media literacy.
gmadsen · 39d ago
the paper is clearly linked in the marketing article, bub
dartos · 39d ago
A (afaics non peer-reviewed) paper published by a random organization endorsed by Anthropic does not proof make.
In fact, it still very much seems like marketing. Especially since the paper was made in association with Anthropic.
Again. Learn some media literacy.
Applejinx · 40d ago
Along these lines one model that might help is to consider LLMs 'wikipedia of all possible correct articles'. Start with Wikipedia and assume (already a tricky proposition!) that it's perfectly correct. Then, begin resynthesizing articles based on what's already there. Do your made-up articles have correctness?
I'm going to guess that sometimes they will: driven onto areas where there's no existing article, some of the time you'll get made-up stuff that follows the existing shapes of correct articles and produces articles that upon investigation will turn out to be correct. You'll also reproduce existing articles: in the world of creating art, you're just ripping them off, but in the world of Wikipedia articles you're repeating a correct thing (or the closest facsimile that process can produce)
When you get into articles on exceptions or new discoveries, there's trouble. It can't resynthesize the new thing: the 'tokens' aren't there to represent it. The reality is the hallucination, but an unreachable one.
So the LLMs can be great at fooling people by presenting 'new' responses that fall into recognized patterns because they're a machine for doing that, and Turing's Test is good at tracking how that goes, but people have a tendency to think if they're reading preprogrammed words based on a simple algorithm (think 'Eliza') they're confronting an intelligence, a person.
They're going to be historically bad at spotting Holmes-like clues that their expected 'pattern' is awry. The circumstantial evidence of a trout in the milk might lead a human to conclude the milk is adulterated with water as a nefarious scheme, but to an LLM that's a hallucination on par with a stone in the milk: it's going to have a hell of a time 'jumping' to a consistent but very uncommon interpretation, and if it does get there it'll constantly be gaslighting itself and offering other explanations than the truth.
admiralrohan · 41d ago
Hallucinating is fine but overconfidence is the problem. But I heard it's not an easy problem to solve.
Terr_ · 41d ago
> overconfidence is the problem.
The problem is a bit deeper than that, because what we perceive as "confidence" is itself also an illusion.
The (real) algorithm takes documents and makes them longer, and some humans configured a document that looks like a conversation between "User" and "AssistantBot", and they also wrote some code to act-out things that look like dialogue for one of the characters. The (real) trait of confidence involves next-token statistics.
In contrast, the character named AssistantBot is "overconfident" in exactly the same sense that a character named Count Dracula is "immortal", "brooding", or "fearful" of garlic, crucifixes, and sunlight. Fictional traits we perceive on fictional characters from reading text.
Yes, we can set up a script where the narrator periodically re-describes AssistantBot as careful and cautious, and that might help a bit with stopping humans from over-trusting the story they are being read. But trying to ensure logical conclusions arise from cautious reasoning is... well, indirect at best, much like trying to make it better at math by narrating "AssistantBot was good at math and diligent at checking the numbers."
> Hallucinating
P.S.: "Hallucinations" and prompt-injection are non-ironic examples of "it's not a bug, it's a feature". There's no minor magic incantation that'll permanently banish them without damaging how it all works.
plaguuuuuu · 40d ago
I'd love to know if the conversational training set includes documents where the AI throws its hands up and goes "actually I have no idea". I'm guessing not.
Terr_ · 40d ago
There's also the problem of whether the LLM would learn to generate stories where the AssistantBot gives up in cases that match our own logical reasons, versus ones where the AssistantBot gives up because that's simply what AssistantBots in training-stories usually do when the User character uses words of disagreement and disapproval.
patates · 41d ago
Hallucinating is a confidence problem, no?
Say, they should be 100% confident that "0.3" follows "0.2 + 0.1 =", but a lot of floating point examples on the internet make them less confident.
On a much more nuanced problem, "0.30000000000000004" may get more and more confidence.
This is what makes them "hallucinate", did I get it wrong? (in other words, am I hallucinating myself? :) )
jacksnipe · 41d ago
Unfortunately, in the system most of us work in today, I think overconfidence is an intelligent behavior.
onemoresoop · 41d ago
I find it extremely dumb to see overconfident people that really have nothing special about them or are even incompetent. These people are not contributing positively to the system, quite on the contrary.
xwolfi · 41d ago
But we don't work for the system, fundamentally, we work for ourselves, and the system incentivizes us to work for it by aligning our constraints: if you work that direction, you'll get that reward.
Overconfident people ofc do not contribute positively to the system, but they skew the system reward's calculation towards them: I swear I've done that work in that direction, where's my reward ?
In a sense, they are extremely successful: they managed to do very low effort, get very high reward, help themselves like all of us but at a much better profit margin, by sacrificing a system that, let's be honest, none of us care about really.
Your problem maybe, is that you swallowed the little BS the system fed you while incentivizing you: that the system matters more than yourself, at least at a greater extent than healthy ?
And you see the same thing with AI: these things convince people so deeply of their intelligence that it blew to such proportion that NVidia is now worth trillions. I had a colleague mumbling yesterday that his wife now speaks more with ChatGPT than him. Overconfidence is a positive attribute... for oneself.
butlike · 40d ago
Overconfident people are conquerors. Conquerors do not contribute positively to a harmonious system, true, but I'm not so sure we can glean the system is supposed to be harmonious.
If one contributes "positively" to the system, everyone's value increases and the solution becomes more homogenized. Once the system is homogenized enough, it becomes vulnerable to adversity from an outside force.
If the system is not harmonious/non-homoginized, the attacker would be drawn to the most powerful point in the system.
Overconfident people aren't evil, they're simply stressing the system to make sure it can handle adversity from an outside force. They're saying: "listen, I'm going to take what you have, and you should be so happy that's all I'm taking."
So I think overconfidence is a positive attribute for the system as well as for the overconfident individual. It's not a positive attribute for the local parties getting run over by the overconfident individual.
onemoresoop · 41d ago
Not talking about THE system or any system in particular same way gaming the system doesn’t refer to any system but just cheating in general. And if you like overconfident people good for you, I can’t stand them because they’re hollow with no basis in reality, flawed like everyone, just pumping out their egos with hot air. And your reasoning that overconfidence is a positive attribute doesn’t make much sense to me but we’re entitled to our own opinions.
jacksnipe · 40d ago
Yeah this is what I meant, both in the behavior being intelligent and it being unfortunate that this is the case. It’d be nice if the most self-maximizing behavior were also the best behavior for the global system, but it doesn’t seem that it is.
tomComb · 41d ago
But being like that can get you elected.
throw4847285 · 40d ago
That's a fallacy. There are certainly some unqualified elected leaders, but humans living in democratic societies have yet to shake the mental framework we've constructed from centuries without self-rule. We invest way more authority into a single executive than they actually have, and blame them for everything that goes wrong despite the fact that modern democracies are hugely complex systems in which authority is distributed across numerous people. When the government fails to meet people's needs, they lack the capacity to point at a particular Senator or a small party in a ruling coalition and blame them. It's always the executive.
Of course, the result is that people get fed up and decide that the problem has been not that democratic societies are hard to govern by design (they have to reflect the disparate desires of countless people) but that the executive was too weak. They get behind whatever candidate is charismatic enough to convince them that they will govern the way the people already thought the previous executives were governing, just badly. The result is an incompetent tyrant.
sim7c00 · 40d ago
and getting yourself elected while being underqualified is intelligent? i think its not. its stupid and damaging behavior based in selfish desires. about as far from intelligent you can get.
rcxdude · 40d ago
Intelligence is seperate from goals: if you're only interested in gaining power and wealth for yourself, then concern about the rest of the system is only incidental to what you can get for yourself.
Freedom2 · 41d ago
In what country?
tbossanova · 41d ago
Quite a few :(
chromehearts · 41d ago
All of them
butlike · 40d ago
Overconfidence allows you to act, increasing survivability. Thinking is a "weak" trait.
ForTheKidz · 41d ago
> I think of them as similar to human optical illusions.
What we call "hallucinations" is far more similar to what we would call "inventiveness", "creativity", or "imagination" in humans than anything to do with what we refer to as "hallucinations" in humans—only they don't have the ability to analyze whether or not they're making up something or accurately parameterizing the vibes. The only connection between the two concepts is that the initial imagery from DeepDream was super trippy.
majormajor · 41d ago
Inventiveness/creativity/imagination are deliberate things. LLM "hallucinations" are more akin to a student looking at a test over material they only 70% remember grabbing at what they think is the most likely correct answer. More "willful hope in the face of forgetting" than "creativity." Many LLM hallucinations - especially of the coding sort - are ones that would be obviously-wrong based on the training material, but the hundreds of languages/libraries/frameworks the thing was trained on start to blur together and there is not precise 100%-memorization recall but instead a "probably something like this" guess.
It's not "inventive" to assume one math library will have the same functions as another, it's just losing sight of specific details.
TeMPOraL · 40d ago
> LLM "hallucinations" are more akin to a student looking at a test over material they only 70% remember grabbing at what they think is the most likely correct answer.
AKA. extrapolation. AKA. what everyone is doing to a lesser or greater degree, when consequences of stopping are worse than of getting this wrong.
That's not just the case of school, where giving up because you "don't know" is guaranteed F, while extrapolating has a non-zero chance of scoring you anything between F and A. It's also the case in everyday life, where you do things incrementally - getting the wrong answer is a stepping stone to getting a less wrong answer in the next attempt. We do that at every scale - from inner thought process all the way to large-scale engineering.
Hardly anyone learns 100% of the material, because that's just plain memorization. We're always extrapolating from incomplete information; more studying and more experience (and more smarts) just makes us more likely to get it right.
> It's not "inventive" to assume one math library will have the same functions as another, it's just losing sight of specific details.
Depends. To a large extent, this kind of "hallucinations" is what a good programmer is supposed to be doing. That is, code to the API you'd like to have, inventing functions and classes convenient to you if they don't exist, and then see how to make this work - which, in one place, means fixing your own call sites, and in another, building utilities or a whole compat layer between your code and the actual API.
ForTheKidz · 41d ago
> Inventiveness/creativity/imagination are deliberate things.
Not really. At least, it's just as much a reflex as any other human behavior to my perception.
Anyway, why does intention—although I think this is mostly nonsensical/incoherent/a category error applied to LLMs—even matter to you? Either we have no goals and we're just idly discussing random word games (aka philosophy), which is fine with me, or we do have goals and whether or not you believe the software is intelligent or not is irrelevant. In the latter case anthropomorphizing discussion with words like "hallucination", "obviously", "deliberate", etc are just going to cause massive friction, distraction, and confusion. Why can't people be satisfied with "bad output"?
Applejinx · 40d ago
If and only if the LLM is able to bring the novel, unexpected connection into itself and see whether it forms other consistent networks that lead to newly common associations and paths.
A lot of us have had that experience. We use that ability to distinguish between 'genius thinkers' and 'kid overdosing on DMT'. It's not the ability to turn up the weird connections and go 'ooooh sparkly', it's whether you can build new associations that prove to be structurally sound.
If that turns out to be something self-modifying large models (not necessarily 'language' models!) can do, that'll be important indeed. I don't see fiddling with the 'temperature' as the same thing, that's more like the DMT analogy.
You can make the static model take a trip all you like, but if nothing changes nothing changes.
AdieuToLogic · 41d ago
> What we call "hallucinations" is far more similar to what we would call "inventiveness", "creativity", or "imagination" in humans ...
No.
What people call LLM "hallucinations" is the result of a PRNG[0] influencing an algorithm to pursue a less statistically probable branch without regard nor understanding.
That seems to be giving the system too much credit. Like "reduce the temperature and they'll go away." A more probable next word based on a huge general corpus of text is not necessarily a more correct one for a specific situation.
Consider the errors like "this math library will have this specific function" (based on a hundred other math libraries for other languages usually having that).
AdieuToLogic · 41d ago
> That seems to be giving the system too much credit. Like "reduce the temperature and they'll go away." A more probable next word based on a huge general corpus of text is not necessarily a more correct one for a specific situation.
I believe we are saying the same thing here. My clarification to the OP's statement:
What we call "hallucinations" is far more similar to what
we would call "inventiveness", "creativity", or
"imagination" in humans ...
Was that the algorithm has no concept of correctness (nor the other anthropomorphic attributes cited), but instead relies on pseudo-randomness to vary search paths when generating text.
cornel_io · 41d ago
There are various results that suggest that LLMs do internally have everything they'd need to know that they're hallucinating/wrong:
So I don't think it's that they have no concept of correctness, they do, but it's not strong enough. We're probably just not training them in ways that optimize for that over other desirable qualities, at least aggressively enough.
It's also clear to anyone who has used many different models over the years that the amount of hallucination goes down as the models get better, even without any special attention being (apparently) paid to that problem. GPT 3.5 was REALLY bad about this stuff, but 4o and o1 are at least mediocre. So it may be that it's just one of the tougher things for a model to figure out, even if it's possible with massive capacity and compute. But I'd say it's very clear that we're not in the world Gary Marcus wishes we were in, where there's some hard and fundamental limitation that keeps a transformer network from having the capability to be more truthful as a it gets better; rather, like all aspects, we just aren't as far along as we'd prefer.
ForTheKidz · 41d ago
> There are various results that suggest that LLMs do internally have everything they'd need to know that they're hallucinating/wrong
We need better definitions of what sort of reasonable expectation people can have for detecting incoherency and self-contradiction when humans are horrible at seeing this, except in comparison to things that don't seem to produce meaningful language in the general case. We all have contradictory worldviews and are therefore capable of rationally finding ourselves with conclusions that are trivially and empirically incoherent. I think "hallucinations" (horribly, horribly named term) are just an intractable burden of applying finite, lossy filters to a virtually continuous and infinitely detailed reality—language itself is sort of an ad-hoc, buggy consensus algorithm that's been sufficient to reproduce.
But yea if you're looking for a coherent and satisfying answer on idk politics, values, basically anything that hinges on floating signifiers, you're going to have a bad time.
(Or perhaps you're just hallucinating understanding and agreement: there are many phrases in the english language that read differently based on expected context and tone. It wouldn't surprise me if some models tended towards production of ambiguous or tautological semantics pleasingly-hedged or "responsibly"-moderated, aka PR.)
Personally, I don't think it's a problem. If you are willing to believe what a chatbot says without verifying it there's little advice I could give you that can help. It's also good training to remind yourself that confidence is a poor signal for correctness.
AdieuToLogic · 39d ago
> There are various results that suggest that LLMs do internally have everything they'd need to know that they're hallucinating/wrong:
The underlying requirement, which invalidates an LLM having "everything they'd need to know that they're hallucinating/wrong", is the premise all three assume - external detection.
From the first arxiv abstract:
Moreover, informed by the empirical observations, we show
great potential of using the guidance derived from LLM's
hidden representation space to mitigate hallucination.
From the second arxiv abstract:
Using this basic insight, we illustrate that one can
identify hallucinated references without ever consulting
any external resources, by asking a set of direct or
indirect queries to the language model about the
references. These queries can be considered as "consistency
checks."
From the Nature abstract:
Researchers need a general method for detecting
hallucinations in LLMs that works even with new and unseen
questions to which humans might not know the answer. Here
we develop new methods grounded in statistics, proposing
entropy-based uncertainty estimators for LLMs to detect a
subset of hallucinations—confabulations—which are arbitrary
and incorrect generations.
Ultimately, no matter what content is generated, it is up to a person to provide the understanding component.
> So I don't think it's that they have no concept of correctness, they do, but it's not strong enough.
Again, "correctness" is a determination solely made by a person evaluating a result in the context of what the person accepts, not intrinsic to an algorithm itself. All an algorithm can do is attempt to produce results congruent with whatever constraints it is configured to satisfy.
ForTheKidz · 41d ago
We really need an idiom for the behavior of being technically correct but absolutely destroying the prospect of interesting conversation. With this framing we might as well go back to arguing over which rock our local river god has blessed with greater utility. I'm not actually entirely convinced humans are capable of understanding much when discussion desired is this low quality.
Critically, creation does not require intent nor understanding. Neither does recombination; neither reformulation. The only thing intent is necessary for is to create something meaningful to humans—handily taken care of via prompt and training material, just like with humans.
(If you can't tell, I thought we had bypassed the neuroticism over whether or not data counts as "understanding", whatever that means to people, on week 2 of LLMs)
AdieuToLogic · 39d ago
> We really need an idiom for the behavior of being technically correct but absolutely destroying the prospect of interesting conversation.
While it is not an idiom, the applicable term is likely pedantry[0].
> I'm not actually entirely convinced humans are capable of understanding much when discussion desired is this low quality.
Ignoring the judgemental qualifier, consider your original post to which I replied:
What we call "hallucinations" is far more similar to what
we would call "inventiveness", "creativity", or
"imagination" in humans ...
The term for this behavior is anthropomorphism[1] due to ascribing human behaviors/motivations to algorithmic constructs.
> Critically, creation does not require intent nor understanding. Neither does recombination; neither reformulation.
The same can be said for a random number generator and a permutation algorithm.
> (If you can't tell, I thought we had bypassed the neuroticism over whether or not data counts as "understanding", whatever that means to people, on week 2 of LLMs)
If you can't tell, I differentiate between humans and algorithms, no matter the cleverness observed of the latter, as only the former can possess "understanding."
I dunno hallucinations seem like a pretty human type of mistake to me.
when i try to remember something my brain often synthesizes new things by filling in the gaps.
This would be where I often say "i might be imagining it, but..." or "i could have sworn there was a..."
In such cases the thing that saves the human brain is double checking against reality (e.g. googling it to make sure).
Miscounting the number of r's in strawberry by glancing at the word also seems like a pretty human mistake.
gitaarik · 41d ago
But it's different kinds of hallucinations.
AI doesn't have a base understanding of how physics work. So they think it's acceptible if in a video some element on the background in a next frame might appear in front of another element that is on the foreground.
So it's always necessary to keep correcting LLMs, because they only learn by example, and you can't express any possible outcome of any physical process just by example, because physical processes can be in infinate variations. LLMs can keep getting closer to match our physical reality, but when you zoom into the details you'll always find that it comes short.
So you can never really trust an LLM. If we want to make an AI that doesn't make errors, it should understand how physics works.
pydry · 40d ago
I dont think the errors really are all that different. Ever since GPT-3.5 came out Ive been thinking that the errors were ones a human could have made under a similar context.
>LLMs can keep getting closer to match our physical reality, but when you zoom into the details you'll always find that it comes short.
Like humans.
>So you can never really trust an LLM.
Cant really trust a human either. That's why we set up elaborate human systems (science, checks and balances in government, law, freedom of speech, markets) to mitigate our constant tendency to be complete fuck ups. We hallucinate science that does not exist, lies to maintain our worldview, jump to conclusions about guilt, build businesses based upon bad beliefs, etc.
>If we want to make an AI that doesn't make errors, it should understand how physics works
An AI that doesnt make errors wouldnt be AGI it would be a godlike superintelligence. I dont think thats even feasible. I think a propensity to make errors is intrinsic to how intelligence functions.
Physics is just one domain that they work in and Im pretty sure some of them already do have varying understandings of physics.
gitaarik · 40d ago
But if you ask a human to draw / illustrate a physical setting, they would never draw something that is physically impossible, because it's obvious to a human.
Of course we make all kinds of little mistakes, but at least we can see that they are mistakes. An LLM can't see it's own mistakes, it needs to be corrected by a human.
> Physics is just one domain that they work in and Im pretty sure some of them already do have varying understandings of physics.
Yeah but that would then not be al LLM or machine learned thing. We would program it so that it understands the rules of physics, and then it can interpret things based on those rules. But that is a totally different kind of AI, or rather a true AI instead of a next-word predictor that looks like an AI. But the development of such AIs goes a lot slower because you can't just keep training it, you actually have to program it. But LLMs can actually help program it ;). Although LLMs are mostly good at currently existing technologies and not necessarily new ones.
antasvara · 40d ago
To be clear, I'm not saying that LLM's exclusively make non-human errors. I'm more saying that most errors are happening for different "reasons" than humans.
Think about the strawberry example. I've seen a lot of articles lately where not all misspellings of the word "strawberry" reliably give letter counting errors. The general sentiment there is human, but the specific pattern of misspelling is really more unique to LLM's (i.e. different spelling errors would impact humans versus LLM's).
The part that makes it challenging is that we don't know these "triggers." You could have a prompt that has 95% accuracy, but that inexplicably drops to 50% if the word "green" is in the question (or something like that).
j45 · 41d ago
Some of the errors are caused by humans. Say, due to changing the chat to only pay attention to recent messages and not the middle, omitting critical details.
tharkun__ · 41d ago
I don't think that's universally true. We have different humans with different levels of ability to catch errors. I see that with my teams. Some people can debug. Some can't. Some people can write tests. Some can't. Some people can catch stuff in reviews. Some can't.
I asked Sonnet 3.7 in Cursor to fix a failing test. While it made the necessary fix, it also updated a hard-coded expected constant to instead be computed using the same algorithm as the original file, instead of preserving the constant as the test was originally written.
Guess what?
Guess the number of times I had to correct this from humans doing it in their tests over my career!
And guess where the models learned the bad behavior from.
fn-mote · 41d ago
> Some people can debug. Some can't. Some people can write tests. Some can't.
Wait… really?
No way do I want to work with someone who can’t debug or write tests. I thought those were entry stakes to the profession.
People whose skills you use in other ways because they are more productive? Maybe. But still. Clean up after yourself. It’s something that should be learned in the apprentice phase.
tharkun__ · 41d ago
Like my sibling says, you can't always choose. That's one side of that coin.
The other is: Some people are naturally good at writing "green field" (or re-writing everything) and do produce actual good software.
But these same people, which you do want to keep around if that's the best you can get, are next to useless when you throw a customer reported bug at them. Takes them ages to figure anything out and they go down endless rabbit holes chasing the wrong path for hours.
You also have people that are super awesome at debugging. They have knack for seeing some brokenness and having the right idea or an idea of the right direction to investigate in right away, can apply the scientific method to test their theories and have the bug fixed in the time it take one of these other people to go down even a single of the rabbit holes they will go down. But these same people in some cases are next to useless if you ask them to properly structure a new green field feature or rewrite parts of something to use some new library coz the old one is no longer maintained or something and digging through said new library and how it works.
Both of these types of people are not bad in and of themselves. Especially if you can't get the unicorns that can do all of these things well (or well enough), e.g. because your company can't or won't pay for it or only for a few of them, which they might call "Staff level".
And you'd be amazed how easy it is to get quite a few review comments in for even Staff level people if you basically ignore their actual code and just jump right into the tests. It's a pet peeve of mine. I start with the tests and go from there when reviewing :)
What you really don't want is if someone is not good at any of these of course.
groby_b · 41d ago
> No way do I want to work with someone who can’t debug or write tests. I thought those were entry stakes to the profession.
Those are almost entry stakes at tier-one companies. (There are still people who can't, it's just much less common)
In your average CRUD/enterprise automation/one-off shellscript factory, the state of skills is... not fun.
There's a reason there's the old saw of "some people have twenty years experience, some have the same year 20 times over". People learn & grow when they are challenged to, and will mostly settle at acquiring the minimum skill level that lets them do their particular work.
And since we as an industry decided to pretend we're a "science", not skills based, we don't have a decent apprenticeship system that would force a minimum bar.
And whenever we discuss LLMs and how they might replace software engineering, I keep remembering that they'll be prompted by the people who set that hiring bar and thought they did well.
30minAdayHN · 41d ago
Little tangent: I realized that currently LLMs can't debug because they only have access to the compile time (just code). Many bugs happen due to run time complex state. If I can make LLMs think like a productive Dev who can debug, then would they become more efficient?
Hoping I can avoid debug death loop where I get into this bad loop of copy pasting the error and hoping LLM would get it right this one time :)
jsight · 40d ago
Yeah, IMO this is also why they can be so terrible at UI. They don't really have a feedback loop for a lot of code related issues yet.
This is changing and I really expect everything to be different 12 months from now.
30minAdayHN · 40d ago
In general, the theme I'm seeing is that we are providing the old tools to a new way of software engineering. Similar to you, I think the abstractions and tools we will work with will be radically different.
Some things I am thinking about:
* Does git make sense if the code is not the abstraction you work with? For example, when I'm vibe coding, my friend is spending 3hrs trying to understand what I did by reading code. Instead, he should be reading all my chat interactions. So I wonder if there is a new version control paradigm
* Logging: Can we auto instrument logging into frameworks that will be fed to LLMs
* Architecture: Should we just view code as bunch of blocks and interactions instead of reading actual LOC. What if, all I care is block diagrams. And I tell tools like cursor, implement X by adding Y module.
tharkun__ · 40d ago
Regarding reading all your chat interactions I'd find that a really tedious way to understand what the code will actually be doing. You might have "vibe coded" for 3 hours, resulting in a bunch of code I can read and understand in a half hour just as well. And I'm not interested in all the "in between" where you're correcting the LLM misunderstanding etc. I only care about the result and whether it does the right thing(s) and whether the code is readable.
If the use of an LLM results in hard to understand spaghetti code that hides intent then I think that's a really bad thing and is why the code should still go through code review. If you, with or without the help of an LLM create bad code, that's still bad code. And without the code and just the chat history we have no idea what we even actually get in the end.
ipsento606 · 41d ago
I've been a professional engineer for over a decade, and in that time I've only had one position where I was expected to write any tests. All my other positions, we have no automated testing of any kind.
david422 · 41d ago
I worked with a new co-worker that ... had trouble writing code, and tests. He would write a test that tested nothing. At first I thought he might be green and just needed some direction - we all start somewhere. But he had on his bio that he had 10 years of experience in software dev in the language we were working in. I couldn't quite figure out what the disconnect was, he ended up leaving a short time later.
nurettin · 41d ago
I've worked with these sorts of people. It is never clear why they don't perform. One of them had clinical depression, another claimed to have low blood values that they simply couldn't fix. And one other just didn't seem to have any working memory beyond one sentence for whatever reason. Do people become like that? Are we going to become like that? It is a scary thought.
hobs · 41d ago
Keyword want - most people don't control who their peers are, and complaining to your boss doesn't get you that far, especially when said useless boss is fostering said useless person.
__MatrixMan__ · 41d ago
I agree. I've been been struck by how remarkably understandable the errors are. It's quite often something that I'd have done myself if I wasn't paying attention to the right thing.
skerit · 40d ago
Claude Sonnet 3.7 really, really loves to rewrite tests so they'll pass. I've had it happen many times in a claude-code session, I had to add this to each request (though it did not fix it 100%)
- Never disable, skip, or comment out failing unit tests. If a unit test fails, fix the root cause of the exception.
- Never change the unit test in such a way that it avoids testing the failing feature (e.g., by removing assertions, adding empty try/catch blocks, or making tests trivial).
- Do not mark tests with @Ignore or equivalent annotations.
- Do not introduce conditional logic that skips test cases under certain conditions.
- Always ensure the unit test continues to properly validate the intended functionality.
jsight · 40d ago
I'm guessing this is a side effect of mistakes in the reinforcement learning face. It'd be really easy to build a reward model that favors passing tests, without properly measuring the quality of those tests.
sorokod · 41d ago
You may find this interesting: "AI Mistakes Are Very Different from Human Mistakes"
Agree, but I would point out that the errors that I make are selected on the fact that I don't notice I'm making them, which tips the scale toward LLM errors being not as bad.
worldsayshi · 41d ago
Yeah it's the reason pair programming is nice. Now the bugs need to pass two filters instead of one. Although I suppose LLM's aren't that good at catching my bugs without me pointing them out.
diggan · 41d ago
I've found both various ChatGPT and Claude to be pretty good at finding unknown bugs, but you need a somewhat hefty prompt.
Personally I use a prompt that goes something like this (shortened here): "Go through all the code below and analyze everything it's doing step-by-step. Then try to explain the overall purpose of the code based on your analysis. Then think through all the edge-cases and tradeoffs based on the purpose, and finally go through the code again and see if you can spot anything weird"
Basically, I tried to think of what I do when I try to spot bugs in code, then I just wrote a reusable prompt that basically repeats my own process.
worldsayshi · 41d ago
It's so interesting that you can tell it to think about a thing and then it does that.
Sounds like a nice prompt to run automatically on PRs.
vanschelven · 41d ago
Nevermind designing _systems_ that account for this, even just debugging such errors is much harder than ones you create yourself:
For that case, it sounds more like having your tools commit for you after each change, as is the default for Aider, is the real winner. "git log -p" would have exposed that crazy import in minutes instead of hours.
commit early, commit often.
danenania · 41d ago
I’m working an AI coding agent[1], and all changes accumulate in a sandbox by default that is isolated from the project.
Auto-commit is also enabled (by default) when you do apply the changes to your project, but I think keeping them separated until you review is better for higher stakes work and goes a long way to protect you from stray edits getting left behind.
One problem with keeping the changes separate is the LLM usually wants to test the code with the incremental new changes. So you need a working tree that has all the new changes. But then... why not use the real one?
danenania · 41d ago
Plandex can tentatively apply the changes in order to execute commands (tests, builds, or whatever), then commit if they succeed or roll back if they fail.
fragmede · 41d ago
If you implement the sandbox as a git branch, then we're on the same page.
danenania · 41d ago
It's built on top of git, but offers better separation imho than just a separate branch.
For one thing, you have to always remember to check out that branch before you start making changes with the LLM. It's easy to forget.
Second, even if you're on a branch, it doesn't protect you from your own changes getting interleaved with the model's changes. You can get into a situation where you can't easily roll back and instead have to pick apart your work and the model's output.
By defaulting to the sandbox, it 'just works' and you can be sure that nothing will end up in the codebase without being checked first.
fragmede · 41d ago
If the latest change is bad, how do you go back in your sandbox? How do you go back three steps? If you make a change outside the sandbox, how do you copy it in? How do you copy them out? How do you deinterleave the changes then?
In order for this sandbox to actually be useful, you're going to end up implementing a source control mechanism. If you're going to do that, might as well just use git, even if just on the backend and commit to a branch behind the scenes that the user never sees, or by using worktree, or any other pieces of it.
Take a good long think about how this sandbox will actually work in practice. Switch to the sandbox, LLM some code, save it, handwrite some code, then switch to the sandbox again, LLM some code, switch out. Try and go backwards half the LLM change. Wish you'd committed the LLM changes while you were working on the.
By the time you've got a handle on it, rembering to switch git branch is the least of your troubles.
danenania · 41d ago
This is all implemented and working, just to be clear, and is being used in production. Everything you mentioned in your comment is covered.
You can also create branches within the sandbox to try different approaches, again with no risk of anything being left behind in your project until it’s ready.
So instead of just learning git, which everyone uses, your users now have to learn git AND plandex commands? In addition to knowing git branch -D, I also need to know plandex delete-branch?
I'm sure it's a win for you since I'm guessing you're the writer of plandex, but you do see how that's just extra overhead instead of just learning git, yeah?
I don't know your target market, so maybe there is a PMF to be found with people who are scared of git and would rather the added overhead of yet another command to learn so they can avoid learning git while using AI.
danenania · 39d ago
I hear you, but I don't think git alone (a single repository, at least) provides what is needed for the ideal workflow. Would you agree there are drawbacks to committing by default compared to a sandbox?
Version control in Plandex is like 4 commands. It’s objectively far simpler than using git directly, providing you the few operations you need without all the baggage. It wouldn't be a win for me to add new commands if only git was necessary, because then the user experience would be worse, but I truly think there's a lot of extra value for the developer in a sandbox layer with a very simple interface.
I should also mention that Plandex also integrates with the project's git repo just like aider does, so you can turn on auto-apply for effectively the same exact functionality if that's what you prefer. Just check out a new branch in git, start the Plandex REPL in a project directory with `plandex`, and run `\set-config auto-apply true`. But if you want additional safety, the sandbox is there for you to use.
fragmede · 39d ago
The problem is I'm too comfortable with git, so I don't see the drawbacks to committing by default. I'm open to hearing about the shortcomings and how I'd address them, though that may not be reasonable to expect for your users.
The problem isn't the four Plandex version control commands or how hard they are to understand in isolation, it's that users now have to adjust their mental model of the system and bolt that onto the side of their limited understanding of git because there's now a plandex branch and there's a git branch and which one was I on and oh god how do they work together?
zahlman · 41d ago
FTA:
> Note that it took me about two hours to debug this, despite the problem being freshly introduced. (Because I hadn’t committed yet, and had established that the previous commit was fine, I could have just run git diff to see what had changed).
> In fact, I did run git diff and git diff --staged multiple times. But who would think to look at the import statements? The import statement is the last place you’d expect a bug to be introduced.
fragmede · 41d ago
git diff != git log.
To expand on that, the problem with only having git diff is there's no way to go backwards halfway. You can't step backwards in time until you get to the bad commit just before the good commit, and then do a precise diff between the two. (aka git bisect) Reviewing 300 lines out of git diff and trying to find the bug somewhere in there is harder than when there are only 10.
dpacmittal · 41d ago
I just prompted cursor to remove a string from a svelte app. It created a boolean variable showString, set it as false and then proceeded to use that to hide the string
rzk · 41d ago
> The LLM knows nothing about your requirements. When you ask it to do something without specifying all of the constraints, it will fill in all the blanks with the most probable answers from the universe of its training set. Maybe this is fine. But if you need something more custom, it’s up to you to actually tell the LLM about it.
Reminds of the saying:
“To replace programmers with AI, clients will have to accurately describe what they want.
We're safe.”
jonahx · 40d ago
> “To replace programmers with AI, clients will have to accurately describe what they want. We're safe.”
I've had similar sentiments often and it gets to the heart of things.
And it's true... for now.
The caveat is that LLMs already can, in some cases, notice that you are doing something in a non-standard way, or even sub-optimal way, and make "Perhaps what you meant was..." type of suggestions. Similarly, they'll offer responses like "Option 1", "Option 2", etc. Ofc, most clients want someone else to sort through the options...
Also, LLMs don't seem to be good at assessment across multiple abstraction levels. Meaning, they'll notice a better option given the approach directly suggested by your question, but not that the whole approach is misguided and should be re-thought. The classic XY problem (https://en.wikipedia.org/wiki/XY_problem).
In theory, though, I don't see why they couldn't keep improving across these dimensions. With that said, even if they do, I suspect many people will still pay a human to interact with the LLM for them for complex tasks, until the difference between human UI and LLM UI all but vanishes.
daxfohl · 40d ago
Yeah, the difference having a human in the loop makes is the ability to have that feedback. Did you think about X? Requirement Y is vague. Z and W seem to conflict.
Up to now, all our attempts to "compile" requirements to code have failed, because it turns out that specifying every nuance into a requirements doc in one shot is unreasonable; you may as well skip the requirements in English and just write them in Java at that point.
But with AI assistants, they can (eventually, presumptively) enable that feedback loop, do the code, and iterate on the requirements, all much faster and more precisely than a human could.
Whether that's possible remains to be seen, but I'd not say human coders are out of the woods just yet.
colonCapitalDee · 41d ago
> Preparatory Refactoring says that you should first refactor to make a change easy, and then make the change. The refactor change can be quite involved, but because it is semantics preserving, it is easier to evaluate than the change itself.
> In human software engineering, a common antipattern when trying to figure out what to do is to jump straight to proposing solutions, without forcing everyone to clearly articulate what all the requirements are. Often, your problem space is constrained enough that once you write down all of the requirements, the solution is uniquely determined; without the requirements, it’s easy to devolve into a haze of arguing over particular solutions.
> When you’re learning to use a new framework or library, simple uses of the software can be done just by copy pasting code from tutorials and tweaking them as necessary. But at some point, it’s a good idea to just slog through reading the docs from top-to-bottom, to get a full understanding of what is and is not possible in the software.
> The Walking Skeleton is the minimum, crappy implementation of an end-to-end system that has all of the pieces you need. The point is to get the end-to-end system working first, and only then start improving the various pieces.
> When there is a bug, there are broadly two ways you can try to fix it. One way is to randomly try things based on vibes and hope you get lucky. The other is to systematically examine your assumptions about how the system works and figure out where reality mismatches your expectations.
> The Rule of Three in software says that you should be willing to duplicate a piece of code once, but on the third copy you should refactor. This is a refinement on DRY (Don’t Repeat Yourself) accounting for the fact that it might not necessarily be obvious how to eliminate a duplication, and waiting until the third occurrence might clarify.
These are lessons that I've learned the hard way (for some definition of "learned", these things are simple but not easy), but I've never seen them phrased to succinctly and accurately before. Well done OP!
duxup · 41d ago
"Preparatory Refactoring says that you should first refactor to make a change easy, and then make the change. "
Amen. I'll be refactoring something and a coworker will say "Wow you did that fast." and I'll tell them I'm not done... those PRs were just to prepare for the final work.
Sometimes after all my testing I'll even leave the "prepared" changes in production for a bit just to be 100% sure something strange wasn't missed. THEN the real changes can begin.
skydhash · 41d ago
> a common antipattern when trying to figure out what to do is to jump straight to proposing solutions, without forcing everyone to clearly articulate what all the requirements are.
This is a quick way to determine if you're in the wrong team. When you're trying to determine the requirements and the manager/client is evading you. As if you're supposed to magically have all the answers.
> When you’re learning to use a new framework or library, simple uses of the software can be done just by copy pasting code from tutorials and tweaking them as necessary.
I tried to use the guides and code examples instead (if they exists). One thing that helps a lot when the library is complex, is to have a prototype that you can poke at to learn the domain. Very ugly code, but will help to learn where all the pieces are.
Singletoned · 39d ago
"The Rule of Three" I have been expressing as "it takes 3 points to make a straight line".
Any two points will look as if they are on a straight line, but you need a third point to confirm the pattern as being a straight line
taberiand · 41d ago
Based on the list, LLMs are at a "very smart junior programmer" level of coding - though with a much broader knowledge base than you'd expect from even a senior. They lack bigger-picture thinking, and default to doing what is asked of them instead of what needs to be done.
I expect the models will continue improving though, I feel like most of it comes down to the ephemeral nature of their context window / the ability to recall and attach relevant information to the working context when prompted.
nomel · 41d ago
> and default to doing what is asked of them instead of what needs to be done.
I don't think it's that simple.
From what I've found, there are "attractors" in the statistics. If a part of your problem is too similar to a very common problem, that the LLM saw a million times, the output will be attracted to those overwhelming statistical next-words, which is understandable. That is the problem I run into most often.
Groxx · 41d ago
It's a constant struggle for me too, both "in the large" and small situations. Using a library which provides special-cased versions of common concepts, like "futures"? You'll get non-stop mistakes and misuses, even if you've got correct ones right next to it, or feed it reams of careful documentation. Got a variable with a name that sounds like it might be a dictionary (e.g. `storesByCity`), but it's actually a list? It'll try to iterate over it like a dictionary, point out "bugs" related to unsorted iteration, and will return `var.Values()` instead of `var` when your func returns a list. Practically every single time, even after multiple rounds of "that's a list"-like feedback or giving it the compilation errors. Got a Clean-Code-like structure in some things but not others? Watch as it assumes everything follows it all the time despite massive evidence to the contrary.
They're rather impressive when building common things in common ways, and a LOT of programming does fit that. But once you step outside that they feel like a pretty strong net negative - some occasional positive surprises, but lots of easy-to-miss mistakes.
techpineapple · 41d ago
I ran into this with cursor a lot. It would keep redoing changes that I explicitly told it I didn’t want. I was coding a game and it would assume things like the players gold should increment at a rate of 5 per tick, then keep putting it back when I said remove it!
taberiand · 41d ago
Oh sure, the flip side of doing what was asked is doing what is known - choosing a solution based on familiarity rather than applicability. Also a common trait in juniors in my experience
nomel · 41d ago
Related, the similarities I see when a human runs out of context window are really interesting.
I do a lot of interviews, and the poor performers usually end up running out of working memory and start behaving very similar to an LLM. Corrections/input from me will go into one ear and fall out the other, they'll start hallucinating aspects of the problem statement in an attractor sort of way, they'll get stuck in loops, etc. I write down when this happens in my notes, and it's very consistently 15 minutes. For all of them, it seems to be the lack of familiarity doesn't allow them to compress/compartmentalize the problem into something that fits in their head. I suspect it's similar for the LLM.
DanHulton · 41d ago
This was my thought when browsing this list, too, and it helped crystalize one of the feelings I had when trying to with with LLMs for coding: I'm a senior developer, and I want to develop as a senior developer does, and turn in senior developer-quality code. I _don't_ want to spend the rest of my career in development simply pairing with/babysitting a junior developer who will never learn from their mistakes. It may be quicker in the short run in some cases, but the code won't be as good and I'm likely to burn out, further amplifying the quality issue.
> I expect the models will continue improving though
I try to push back on this every time I see it as an excuse for current model behaviour, because what if they don't? Like, not appreciably enough to make a real difference? What if this is just a fundamental problem that remains with this class of AI?
Sure, we've seen incredible improvements over a short period of time in model capability, but those improvements have been visibly slowing down, and models have gotten much more expensive to train. Not to mention that a lot of the problem issues mentioned in this list are problems that these models have had for several generations now, and haven't gotten appreciably better, even while other model capabilities have.
I'm saying this not to criticize you, but more to draw attention to our tendency to handwave away LLM problems with a nebulous "but they'll get better so they won't be a problem." We don't actually know that, so we should factor that uncertainly into our analysis, not dismiss it as is commonly done.
ezyang · 41d ago
I definitely agree that for current models, the problem is finding where the LLM has comparative advantage. Usually it's something like (1) something boring, (2) something where you don't have any of the low level syntax or domain knowledge, or (3) you are on manager schedule and you need to delegate actual coding.
threeseed · 41d ago
I wonder if people who say LLMs are a smart junior programmer have ever used LLMs for coding or actually worked with a junior programmer before. Because for me the two are not even remotely comparable.
If I ask Claude to do a basic operation on all files in my codebase it won't do it. Half way through it will get distracted and do something else or simply change the operation. No junior programmer will ever do this. And similar for the other examples in the blog.
taberiand · 41d ago
Right, that is their main limitation currently - unable to consider the full system context when operating on a specific feature. But you must work with excellent juniors (or I work with very poor ones) because getting them to think about changes in the context of the bigger picture is a challenge.
qingcharles · 41d ago
This is definitely a huge factor I see in the mistakes. If I hand an LLM some other parts of the codebase along with my request so that it has more context, it makes less mistakes.
These problems are getting solved as LLMs improve in terms of context length and having the tools send the LLM all the information it needs.
ohgr · 41d ago
Yep. My usual sort of conversation with an LLM is MUCH worse than a junior developer...
Write me a parser in R for nginx logs for kubernetes that loads a log file into a tibble.
Fucks sake not normal nginx logs. nginx-ingress.
Use tidyverse. Why are you using base R? No one does that any more.
Why the hell are you writing a regex? It doesn't handle square brackets and the format you're using is wrong. Use the function read_log instead.
No don't write a function called read_log. Use the one from readr you drunk ass piece of shit.
Ok now we're getting somewhere. Now label all the columns by the fields in original nginx format properly.
What the fuck? What have you done! Fuck you I'm going to just do it myself.
... 5 minutes later I did a better job ...
woah · 41d ago
It's a machine, use it like one
ohgr · 41d ago
Yeah I did. I used the text editor to assemble the libraries into something that worked.
LinXitoW · 40d ago
I mean, I'd never expect a junior to do better for such a highly specific task.
I expect I'd have to hand feed them steps, at which point I imagine the LLM will also do much better.
fn-mote · 41d ago
Except for the first paragraph, I couldn’t tell if you were talking to an incompetent junior or an LLM.
I expected the lack of breadth from the junior, actually.
ohgr · 41d ago
I swear at the junior programmers less.
To be fair the guys I get are pretty good and actually learn. The model doesn't. I have to have the same arguments over and over again with the model. Then I have to retain what arguments I had last time. Then when they update the model it comes up with new stupid things I have to argue with it on.
Net loss for me. I have no idea how people are finding these things productive unless they really don't know or care what garbage comes out.
groby_b · 41d ago
> the guys I get are pretty good and actually learn. The model doesn't.
Core issue. LLMs never ever leave their base level unless you actively modify the prompt. I suppose you _could_ use finetuning to whip it into a useful shape, but that's a lot of work. (https://arxiv.org/pdf/2308.09895 is a good read)
But the flip side of that core issue is that if the base level is high, they're good. Which means for Python & JS, they're pretty darn good. Making pandas garbage work? Just the task for an LLM.
But yeah, R & nginx is not a major part of their original training data, and so they're stuck at "no clue, whatever stackoverflow on similar keywords said".
vipshek · 41d ago
Perhaps swearing at the LLM actually produces worse results?
Not sure if you’re being figurative, but if what you wrote in your first comment is indicative of the tone with which you prompt the LLM, then I’m not surprised you get terrible results. Swearing at the model doesn’t help it produce better code. The model isn’t going to be intimidated by you or worried about losing their job—which I bet your junior engineers are.
Ultimately, prompting LLMs is simply a matter of writing well. Some people seem to write prompts like flippant Slack messages, expecting the LLM to somehow have a dialogue with you to clarify your poorly-framed, half-assed requirement statements. That’s just not how they work. Specify what you actually want and they can execute on that. Why do you expect the LLM to read your mind and know the shape of nginx logs vs nginx-ingress logs? Why not provide an example in the prompt?
It’s odd—I go out of my way to “treat” the LLMs with respect, and find myself feeling an emotional reaction when others write to them with lots of negativity. Not sure what to make of that.
ohgr · 41d ago
That's more my inner monologue than what is typed into the LLM.
qingcharles · 41d ago
But at the same time it'll write me 2000 lines of really gnarly text parsing code in a very optimized fashion that would have taken a senior dev all day to crank out.
We have to stop trying to compare them to a human, because they are alien. They make mistakes humans wouldn't, and they complete very difficult tasks that would be tedious and difficult for humans. All in the same output.
I'm net-positive from using AI, though. It can definitely remove a lot of tedium.
curious_cat_163 · 41d ago
> If I ask Claude to do a basic operation on all files in my codebase it won't do it.
Not sure exactly how you used Claude for this, but maybe try doing this in Cursor (which also uses Claude by default)?
I have had pretty good luck with it "reasoning" about the entire codebase of a small-ish webapp.
zarathustreal · 41d ago
Since when is “do something on every file in my codebase” considered coding?
andoando · 41d ago
Maybe its not but its a comparatively simple task a junior developer can do.
threeseed · 41d ago
Refactoring has been a thing since well forever.
ohgr · 41d ago
Well that's the hard bit I really want help with because it takes time.
I can do the rest myself because I'm not a dribbling moron.
lelanthran · 41d ago
> I expect the models will continue improving though,
How? They've already been trained on all the code in the world at this point, so that's a dead end.
The only other option I see is increasing the context window, which has diminishing returns already (double the window for a 10% increase in accuracy, for example).
We're in a local maxima here.
dcre · 41d ago
This makes no sense. Claude 3.7 Sonnet is better than Claude 3.5 Sonnet and it’s not because it’s trained on more of the world’s code. The models are improving in a variety of ways, whether by being larger, faster, using the same number of parameters more effectively, better RLHF techniques, better inference-time compute techniques, etc.
lelanthran · 41d ago
> The models are improving in a variety of ways, whether by being larger, faster, using the same number of parameters more effectively, better RLHF techniques, better inference-time compute techniques, etc.
I didn't say they weren't improving.
I said there's diminishing returns.
There's been more effort put into LLMs in the last two years than in the two years prior, but the gains in the last two years have been much much smaller than in the two years prior.
That's what I meant by diminishing returns: the gains we see are not proportional to the effort invested.
dcre · 40d ago
You said we're in a local maximum. Your comment was at odds with itself.
taberiand · 41d ago
One way is mentioned in the article, expanding and improving MCP integrations
- give the models the tools to work more effectively within their limitations on problems in the context of the full system.
ezyang · 41d ago
Hi Hacker News! One of the things about this blog that has gotten a bit unwieldy as I've added more entries is that it's a sort of undifferentiated pile of posts. I want some sort of organization system but I haven't found one that's good. Very open to suggestions!
joshka · 41d ago
What about adding a bit more structure and investing in a pattern language approach like what you might find in a book by Fowler or a site like https://refactoring.guru/. You're much of the way there with the naming and content, but could refactor the content a bit better into headings (Problem, Symptoms, Examples, Mitigation, Related, etc.)
You could even pretty easily use an LLM to do most of the work for you in fixing it up.
Add a short 1-2 sentence summary[1] to each item and render that on the index page.
Maybe organize them more clearly split between observed pitfalls/blindspots and prescriptions. Some of the articles (Use automatic formatting) are Practice forward, while others are pitfall forward. I like how many of the articles have examples!
smusamashah · 41d ago
How about listing all if these on 1 single page? Will be easy to navigate/find.
ezyang · 41d ago
They are listed on one page right now! Haha
elicash · 41d ago
They're indexed on one page, but you can't scan/scroll through these short posts without clicking because the content itself isn't all on a single page, at least not that I can find.
(I also like the other idea of separating out pitfalls vs. prescriptions.)
lelandfe · 41d ago
Wordpress’s approach to this is giving each post a short description in addition to the main content. The excerpt gets displayed on the main list, which helps both to grok the post and keep the list from becoming unwieldy.
smusamashah · 40d ago
As in, all content on one page where the link just takes you to appropriate heading on the same page. These days you can do a lot on a single html.
rav · 41d ago
My suggestion: Change the color of visited links! Adding a "visited" color for links will make it easier for visitors to see which posts they have already read.
cookie_monsta · 41d ago
Some sort of navigation would be nice a prev/next or some way to avoid having to go back to the links page all the time.
All of the pages that I visited were small enough that you could probably wrap them them <details> tags[1] and avoid navigation altogether
There was a blog posted here which had a slider for scoring different features (popularity, personal choice, etc). The rankings updated live with slider moves.
To be honest, current format worked perfectly for me: I ended up reading all entries without feeling something was off in how they were organized. I really really liked that each section had a concrete example, please don't remove that for future entries.
Thank you for sharing your insights! Very generous.
mncharity · 41d ago
In "Keep Files Small", there seems a lacuna: "for example, on Cursor 0.45.17, applying 55 edits on a 64KB file takes)."
sfink · 41d ago
When I saw the title, I knew what this was going to be. It made me want to immediately write a corresponding "Human Blindspots" blog post to counteract it, because I knew it was going to be the usual drivel about how the LLMs understand <X> but sometimes they don't quite manage to get the reasoning right, but not to worry because you can nudge them and their logical brains will then figure it out and do the right thing. They'll stop hallucinating and start functioning properly, and if they don't, just wait for the next generation and everything will be fine.
I was wrong. This is great! I really appreciate how you not only describe the problems, but also describe why they happen using terminology that shows you understand how these things work (rather than the usual crap that is based on how people imagine them to work or want them to work). Also, the examples are excellent.
It would be a bunch of work, but the organization I would like to see (alongside the current, not replacing it, because the one-page list works for me already) would require sketching out some kind of taxonomy of topics. Categories of ways that Sonnet gets things wrong, and perhaps categories of things that humans would like them to do (eg types of tasks, or skill/sophistication levels of users, or starting vs fixing vs summarizing/reviewing vs teaching, or whatever). But I haven't read through all of the posts yet, so I don't have a good sense for how applicable these categorizations might be.
I personally don't have nearly enough experience using LLMs to be able to write it up myself. So far, I haven't found LLMs very useful for the type of code I write (except when I'm playing with learning Rust; they're pretty good for that). I know I need to try them out more to really get a feel for their capabilities, but your writeups are the first I've found that I feel I can learn from without having to experience it all for myself first.
(Sorry if this sounds like spam. Too gushing with the praise? Are you bracing yourself for some sketchy URL to a gambling site?)
jonas21 · 41d ago
Maybe you should ask Claude.
duxup · 41d ago
I find LLMs WANT TO ANSWER TOO MUCH. If I give them too little data, they're not curious and they'll try to craft an answer when it's nearly impossible for them to be right.
I'll type and hit enter too early and I get an answer and think "This could never be right because I gave you broken sentences and too little." but there it goes answering away, dead wrong.
I would rather the LLM say "yo I don't know what you're talking I need more" but of course they're not really thinking so they don't do that / likely can't.
The LLM nature to run that word math and string SOMETHING together seems like an very serious footgun. Reminds me of the movie 2010 when they discuss how the HAL 9000 couldn't function correctly because it was told to lie despite its core programming to tell the truth. HAVING to answer seems like a serious impediment for AI. I see similar-ish things on google's gemini AI when I ask a question and it says the answer is "no" but then gives all the reasons the answer is clearly "yes".
jredwards · 41d ago
The most annoying thing I've found is that they always assume I'm right. If I ask a question, they assume the answer is yes, and will bend over backwards in an obsequious manner to ensure that I'm correct.
"Why of course, sir, we should absolutely be trying to compile python to assembly in order to run our tests. Why didn't I think of that? I'll redesign our testing strategy immediately."
j_bum · 41d ago
Ugh , I agree.
I would imagine this all comes from fine tuning, or RLHF, whatever is used.
I’d bet LLMs trained on the internet without the final “tweaking” steps would roast most of my questions … which is exactly what I want when I’m wrong without realizing it.
enraged_camel · 41d ago
>> The most annoying thing I've found is that they always assume I'm right.
Not always. The other day I described the architecture of a file upload feature I have on my website. I then told Claude that I want to change it. The response stunned me: it said "actually, the current architecture is the most common method, and it has these strengths over the other [also well-known] method you're describing..."
The question I asked it wasn't "explain the pros and cons of each approach" or even "should I change it". I had more or less made my decision and was just providing Claude with context. I really didn't expect a "what you have is the better way" type of answer.
hnbad · 40d ago
Similarly, with Claude in Cursor I've found that it will assume it's wrong when I even suggest that it might be: "Are you sure that's right? I've not seen that method before" will be followed by "I need to apologize, let me correct myself" and a wrong answer and this'll loop until eventually arriving at a worse version of what it suggested first even if I tell it "Nevermind, you were right in the first place. Let's go with that one".
mulmboy · 41d ago
Yeah I get this. Often I'll prompt it like "my intern looked at this and said maybe you should x. What do you think?"
Seems to help.
bredren · 41d ago
h/t to @knurlknurl on Reddit today shared these methods:
- “I need you to be my red team”(works really well with, Claude seems to understand the term)
“analyze the plan and highlight any weaknesses, counter arguments and blind spots
critically review”
> you can't just say "disagree with me", you have to prompt it into adding a "counter check".
duxup · 41d ago
It’s funny AI will happily follow my lead and “bounce too close to a supernova” and I really have to push it to offer something new.
magicmicah85 · 41d ago
I’ve been prefacing every code related question with “Do not write code. Ask me clarifying questions and let’s talk this out first”. Seems to help especially with planning and organizing a design rather than monkeying with code fixing it later.
bredren · 41d ago
I incorporate this into system prompts at the start of conversations and still find I have to emphasize it again over course of convos.
magicmicah85 · 40d ago
Yeah, they forget as the chat context gets too large. A good example I’ve had is where I’ve been using chartkick to create a lot of charts, and suddenly they want to use another ruby gem. I have to remind them. We’re using chart kick.
duxup · 41d ago
Thank you.
imoreno · 41d ago
It's possible to mitigate this with a conservative system prompt.
duxup · 41d ago
Do you have an example? I’m curious.
otabdeveloper4 · 40d ago
> I find LLMs WANT TO ANSWER TOO MUCH.
That's easy to fix. You need to add something like "give a succinct answer in one phrase" to your prompts.
jon_richards · 40d ago
I can’t tell if this was intentional, but it’s a hilarious joke. OP was referring to the decision to provide an “answer”, not the length of the response.
otabdeveloper4 · 37d ago
LLMs can't think. They're just fancy autocomplete with a lot of context.
This means you need to prompt them with a text that increases the probability of getting back what you want. Adding something about the length of the response will do that.
lukev · 41d ago
This is exceptionally useful advice, and precisely the way we should be talking about how to engage with LLMs when coding.
That said, I take issue with "Use Static Types".
I've actually had more success with Claude Code using Clojure than I have Typescript (the other thing I tried.)
Clojure emphasizes small, pure functions, to a high degree. Whereas (sometimes) fully understanding a strong type might involve reading several files. If I'm really good with my prompting to make sure that I have good example data for the entity types at each boundary point, it feels like it does a better job.
My intuition is that LLMs are fundamentally context-based, so they are naturally suited to an emphasis on functions over pure data, vs requiring understanding of a larger type/class hierarchy to perform well.
But it took me a while to figure out how to build these prompts and agent rules. A LLM programming in a dynamic language without a human supervising the high-level code structure and data model is a recipe for disaster.
torginus · 41d ago
I have one more - LLMs are terrible at counting and arithmetic - if your code gen relies on cutting off the first two words of a constant string - you better check if you need to cut off 12 characters like the LLM says. If it adds 2 numbers, it might be suspect. If you need it to decode a byte sequence, where getting the numbers from the exact right position is necessary.. you get the idea.
Took me a day to debug my LLM-generated code - and of course, like all fruitless and long debugging sessions, this one started with me assuming that it can't possibly get this wrong - yet it did.
datadrivenangel · 41d ago
Almost all of these are good things to consider with human coders as well. Product managers take note!
> I had some test cases with hard coded numbers that had wobbled and needed updating. I simply asked the LLM to keep rerunning the test and updating the numbers as necessary.
Why not take this a step farther and incorporate this methodology directly into your test suite? Every time you push a code change, run the new version of the code and use it to automatically update the "expected" output. That way you never have to worry about failures at all!
ezyang · 41d ago
In fact, the test framework I was using at the time (jest) did in fact support this. But the person who had originally written the tests hadn't had the foresight to use snapshot tests for this failing test!
diggan · 41d ago
I don't know if your message is a continuation of the sarcasm (I feel like maybe no?), but I'm pretty sure parent's joke is that if you just change the expected values whenever the code changes, you aren't really effectively "testing" anything as much as "recording" outputs.
akomtu · 41d ago
LLMs aren't AI. They are more like librarians with eidetic memory: they can discuss in depth any book in the library, but sooner or later you notice that they don't really understand what they are talking about.
One easy test for AI-ness is the optimization problem. Give it a relatively small, but complex program, e.g. a GPU shader on shadertoy.com, and tell it to optimize it. The output is clearly defined: it's an image or an animation. It's also easy to test how much it's improved the framerate. What's good is this task won't allow the typical LLM bullshitting: if it doesn't compile or doesn't draw a correct image, you'll see it.
The thing is, the current generation of LLMs will blunder at this task.
ezyang · 41d ago
The thing is that, as many junior engineers can attest, randomly blundering around can still give you something useful! So you definitely can get value out of AI coding with the current generation of models.
xigency · 41d ago
I can't wait to see the future of randomly blundered tech as we continue to sideline educated, curious, and discerning human engineers from any salaried opportunity to apply their skills.
I've been working as a computer programmer professionally since I was 14 years old and in the two decades since I've been able to get paid work about ~50% of the time.
Pretty gnarly field to be in I must say. I rather wish I had studied to be a dentist. Then I might have some savings and clout to my name and would know I am helping to spread more smiles.
And for the cult of matrix math if >50% of people are dissatisfied with the state of the something, don't be surprised if a highly intelligent and powerful entity becoming aware of this fact engages in rapid upheaval.
shihab · 41d ago
Today I came across an interesting case where 3 well-known LLMs (O1, sonnet 3.7 and Deepseek R1) found a "bug" that actually didn't exist.
Very briefly, in a fused cuda kernel, I was using thread i to do some stuff on locations i, i+N, i+2*N of an array. Later in the same kernel, same thread operated on i,i+1,i+2. All LLMs flagged the second part as bug. Not the most optimized code maybe, but definitely not a bug.
It wasn't a complicated kernel (~120 SLOC) either, and the distance between the two code blocks was about only 15 LOC.
seanwilson · 40d ago
> The eternal debate between dynamic and static type systems concerns the tradeoff between ease of prototyping and long term maintainability ... Unfortunately, the training corpus highly emphasizes Python and JavaScript.
Anyone have experience here with how well strong static types help LLMs? You'd think it would be a great match, where the type errors give feedback to the LLM on what to fix. And the closer the types got to specifying the shape of the solution, the less guidance the LLM would need.
Would be interesting to see how well LLMs do at translating unit test examples and requirements in English into a set of types that describe the program specification, and then have the LLM generate the code from that. I haven't kept up here, but guessing this is really interesting for formal verification, where types can accurately capture complex specifications but can be challenging to write.
I find it quite sad that it's taken so long to get traction on using strong static types to eliminate whole classes of errors at the language level, and instead we're using super AI as a bandaid to churn out bug fixes and write tests in dynamically types languages for properties that static types would catch. Feels backwards.
Mc91 · 41d ago
One thing I do is go to Leetcode, see the optimal big O time and space solutions, then give the LLM the Leetcode medium/hard problem, and limit it to the optimal big O time/space solution and suggest the method (bidirectional BFS). I ask for the solution in some fairly mainstream modern language (although not Javascript, Java or Python). I also say to do it as compact as possible. Sometimes I reiterate that.
It's just a function usually, but it does not always compile. I'd set this as a low bar for programming. We haven't even gotten into classes, architecture, badly-defined specifications and so on.
LLMs are useful for programming, but I'd want them to clear this low hurdle first.
bongodongobob · 41d ago
You're using a shitty model then or are lying. 4o one or two shotted the first 12 days of advent of code for me without anything other than the problem description.
suddenlybananas · 41d ago
How do we know you're not lying?
bongodongobob · 39d ago
Lots of people were doing it during Advent of code. There was a post here about it.
apwell23 · 40d ago
can you unleash it on issues list on pytorch and see how many it can solve and submit patches?
xigency · 41d ago
Ahh yes. Because that other AI model is 100% perfect. Gee whiz.
Man the people working on these machines, selling them, and using them lack the very foundational knowledge of information theory.
Let alone understanding the humanities and politics. Subjectively speaking, humans will never be satisfied with any status quo. Ergo there is no closed-form solution to meeting human wants.
Now disrupting humans' needs, for profit, that is well understood.
Sam Altman continuing to stack billions after allegedly raping his sister.
bongodongobob · 41d ago
Well then live like someone would before the industrial revolution. The hypocrisy of saying shit like this on "The Internet" is always funny to me.
xigency · 41d ago
It's not ironic that technologists use very barebones and minimal websites with minimal automation. It's telling.
I already ditched my smartphone last month because it was 100% spam, scammers, and bots giving me notifications.
Apparently it's too much to ask to receive a well-informed and engaged society without violence and theft. So I don't take anyone at their word and even less so would trust automated data mining, tracking and profiling that seeks to guide my decision making.
Buy me a drink first SV before you crawl that far up my ass.
AtlasBarfed · 40d ago
It's kind of weird so a lot of my queers just really aren't fulfilled that well.
One recent example is give me the names of 200 dragons from literature or media and it really gave up after about 80.
And there's literally a web page that says 200 famous dragons as well as a Wikipedia page.
Maybe it's some free tier limits of chatgpt. It's just strange to see these stories about AI services solving extremely advanced math and I ask it about a simple and basic a question as there is something it should be in his wheelhouse with a breath-based large amount of media ingestion... it should be able to answer fairly easily...
submeta · 41d ago
> Preparatory refactoring
> Current LLMs, without a plan that says they should refactor first, don’t decompose changes in this way. They will try to do everything at once.
Just today I leaned the hard way. I had created an app for my spouse and myself for sharing and reading news-articles, some of them behind paywalls.
Using Cursor I have a FastAPI backend and a React frontend. When I added extracting the article text in markdown and then summarizing it, both using openai, and when I tasked Cursor with it, the chaos began. Cursor (with the help of Claude 3.7) tackled everything at once and some more. It started writing a module for using openai, then it also changed the frontend to not only show the title and url, but also the extracted markdown and the summary, by doing that it screwed up my UI, deleted some rows in my database, came up with as module for interacting with Openai that did not work, the ectraction was screwed, the summary as well.
All of this despite me having detailed cursorrules.
That‘s when I realized: Divide and conquer. Ask it to write one function that workd, then one class where the function becomes a method, test it, then move on to next function. Until every piece is working and I can glue them together.
irskep · 41d ago
One thing I do to avoid this problem is to ask the LLM to make a plan and write it to a doc. Then in a new session, have it read the doc and tell it to implement a specific part of the plan. It saves you almost as many brain cycles as just having the LLM do it all in one go, but gives you direct control over how things happen and how much gets done at once. You can also tweak the plan by hand or iterate on it with the LLM.
pomatic · 41d ago
This is the way, small bite-sized pieces of the elephant. Unfortunately it means you do need to understand programming concepts, composition and to a lesser degree, architecture. On the positive side - these are new tools, and we need to learn how to work with them. They do have the power to nX times the person who has a bit of knowledge and can also adapt to their ways.
AustinDev · 41d ago
Use claude 3.5 if you have detailed instructions that you want it to follow. I've found over many hours of using these models that 3.7 loves to go off-script no matter how many rules you provide.
noname120 · 41d ago
Aider has an architect mode exactly for that purpose
oglop · 41d ago
I just talk to an LLM like it’s a person who is smart, meaning I expect it to be confidently wrong now and then but I don’t have to worry about hurting its feelings. They are remarkably similar to people, though others seems to not think that so maybe it is a case of some people finding them easier to work with compared to others. I wonder what drives that. Maybe it’s the difference between a person who thinks life unfolds before then vs the person who views life as a bundle, with each day a fold making up your experience and through this stack you discern the structure which is your life, which sure seems how these things work.
mystified5016 · 41d ago
Recently I've been writing a resume/hire-me website. I'm not a stellar writer, but I'm alright, so I've been asking various LLMs to review it by just dropping the HTML file in.
Every single one has completely ignored the "Welcome to nginx!" Header at the top of the page. I'd left it in half as a joke to amuse myself but I expected it would get some kind of reaction from the LLMs, even if just a "it seems you may have forgotten this line"
Kinda weird. I even tried guiding them into seeing it without explicitly mentioning it and I could not get a response.
namaria · 41d ago
It didn't ignore it. There just wasn't any pattern in the training data about responding to such a line.
Having the mental model that the text you feed to an LLM influences the output but is not 'parsed' as 'instructions' helps understand its behaviors. The website GP linked is searching for a zoo of problems and missing the biology behind.
LLMs don't have blindspots, they don't reason nor hallucinate. They don't follow instructions. They pattern match on high dimensional vector spaces.
SparkyMcUnicorn · 41d ago
Have you tried "Let's get this production ready" as a prompt for this or any other coding tasks?
Sometimes when I ask for "production ready" it can go a bit too far, but I've found it'll usually catch things like this that I might miss.
eschaton · 41d ago
Why would you expect it to “get some kind of reaction?” That strongly implies that you perceive what the LLM doing as “understanding” the tokens you’re feeding it, which *is not something LLMs are capable of*.
ozmodiar · 41d ago
Come on man, even chemicals react.
No comments yet
meltyness · 41d ago
The Rust<->Typing axis mentioned as a blindspot definitely resonates.
As a novice in the language, the amount of type-inference that good Rust incorporates can make things opaque, absent rust-analyzer.
kleton · 41d ago
Most of the things are applicable to the current top models, but he frequently references Claude sonnet, which is not even above the fold on the leaderboard
prmph · 41d ago
Which models in your opinion are on the leaderboard?
That leaderboard is not for a coding use case -- just general chat.
Click on the web dev leaderboard they have and Claude has the top spots.
It is well known that Claude 3.7 sonnet is the go-to choice for many people for coding right now.
atleastoptimal · 41d ago
Pretty much everyone's career security over the next few years (until AGI) is to aggressively pay attention to and arbitrage on AI blindspots
logicchains · 41d ago
I found Gemini Flash Thinking Experimental is almost unusable in an agent workflow because it'll eventually accidentally remove a closing bracket, breaking compilation, and be unable to identify and fix the issue even with many attempts. Maybe it has trouble counting/matching braces due to fewer layers?
ezyang · 41d ago
Yeah, Sonnet 3.5/3.7 are doing heavy lifting. Maybe the SOTA Gemini models would do better, I haven't tried them. Generating correct patches is a funny minigame that isn't really solved, despite how easy it is to RL on.
diggan · 41d ago
> Maybe the SOTA Gemini models would do better, I haven't tried them
As I had to upgrade my Google Drive storage like a month ago, I gave them all a try. Short version: If you have paid plan with OpenAI/Claude already, none of them come even close, for coding at least. I thought I was trying the wrong models at first, but after confirming it seems like Google is just really far behind.
woah · 41d ago
Strange to read this and the parent comment, since Cursor has never made a single error applying patches for me. The closest it's come is when the coding model adds unnecessary changes which of course is a completely different thing.
logicchains · 41d ago
Which model are you using with Cursor?
woah · 40d ago
Usually Claude 3.5, but I believe they have a separate application model which puts the code that the bigger model suggests into the file
logicchains · 41d ago
o3-mini works well enough for me, it makes mistakes but generally it can always fix them eventually. Interestingly I found even if I include the line numbers as comments in the code it sees, it still often gets the line numbers wrong for edits (most often, off by one errors, likely due to it mixing up whether the line numbers are inclusive or exclusive). What does work a bit better is asking it to provide regex matching the first and last line of what it wants to replace, along with nearby line numbers (so if there are multiple matches in that file for the regex, it gets the right one).
admiralrohan · 41d ago
Even in the age of Vibe coding, I always try to learn as much as possible.
For example, yesterday I was working with the Animation library Motion which I never worked earlier. I used the code suggested by AI but at least picke 2-3 basic animation concepts while reviewing the code.
Kind of unfocused passive learning I always tried even before AI.
worldsayshi · 41d ago
> Even in the age of Vibe coding, I always try to learn as much as possible.
Even? It kind of has become easier than ever to learn new ways to code? Just as it opens up building things that you previously wouldn't because of time constraints, you can now learn how to X in language Y in a few minutes instead of hours.
Although I suppose it may be easier than ever for the brain to think that "I can look this up whenever so I might just forget about it".
fooker · 41d ago
I have noticed a very interesting deficiency. I work on compilers and a bunch of the code I write is for generating other code, and I have to do some second order reasoning about the behavior of the generated code.
I haven't been able to make LLMs do this well.
sourtrident · 40d ago
I've noticed coding with LLMs feels like pair programming with an overly confident intern - brilliant ideas, but needs reminders about humility and structure before burning down your repo. Keeps things lively though.
taherchhabra · 41d ago
Monorepo vs seperate repo for frontend and backend. Which one is better for AI coding?
icelancer · 41d ago
My favorite eval are based on cv2 dlib work, primarily face_recognition. Up until 3.7 Sonnet, it consistently got things wrong in terms of face embeddings and general coding practices around them.
3.7 Sonnet is much better. o3-mini-high is not bad.
They do improve!
boredtofears · 41d ago
Great read, I can definitely confirm a lot of these myself. Would be nice to see this aggregated into some kind of "best practices" document (although hard to say how quickly it'd be out of date).
fritzo · 41d ago
Rule of three is obsolete in age of AI assist. New rule is rule of 10ish.
NiloCK · 41d ago
I don't agree with this, or maybe I don't get it.
The point of DRY isn't to save the time on typing - it's to retain a single source of truth. If you've used an LLM to recreate some mechanism for you system in 8 different places, and that mechanism needs to change ... good luck finding them all.
fritzo · 40d ago
Specifically within a single file, I find LLMs can easily extend and maintain large sets of slightly varying concrete code, yet they struggle when that code is compressed into abstractions. The concrete examples form a dataset. Attention generalizes from that dataset.
I'll agree that rule of three continues to apply for patterns across files, where there is less guarantee that all patterns will be read or written together in any given AI action.
fizx · 41d ago
The community seems rather divided as to whether these are intrinsic, or we solve these with today's tech, and more training, heuristics and workarounds.
dataviz1000 · 41d ago
Are you using Cursor? I'm using Github Copilot in VSCode and I'm wondering if I will get more efficiency from a different coding assistant.
diggan · 41d ago
I've tried Copilot, Aider and Cursor and the best I've found is to just use the various chat interfaces. I sometimes throw hundreds of lines straight in there, and the models seem to understand the full context much better than any "LLM Editor" I've tried so far. Then different models are good for different things (obvious maybe). For example, O1 Pro is miles ahead any models when it comes to overall architecture, R1 is great for finding nasty bugs and Sonnet great for small and fast feature additions/modifications with strict requirements.
dsabanin · 41d ago
You will. Cursor is much further along on the journey of building an actually powerful AI coding system. Since they are smaller, they can afford to iterate more quickly and experiment with a much tighter feedback loop.
ezyang · 41d ago
I have used Cursor and my own MCP codemcp. Cursor has a lot of nice QoL that you can't get from an MCP package; the TAB is really good for traditional coding. Haven't used copilot so I don't have a comparison there. Definitely use agent mode.
hooloovoo_zoo · 41d ago
It doesn't matter. They're all thin layers on functionally equivalent models. Stick with whatever text editor you prefer.
DeathArrow · 41d ago
I am fiddling with tools like Cursor, Aider, Augment Code, Roo Code and LLMs like GPT, Sonnet, Grok, Deepseek to try to decide whether I can use AI for what I need, and if yes, identify some good workflows. I've read experiences of other people and tried my own ideas. I've burnt countless tokens, fast searches and US dollars.
Working with AI for writing code is painful. It can break the code in ways you've never imagined and introduce bugs you never thought are possible. Unit testing and integration testing doesn't help much, because AI can break those, too.
You can ask AI to run in loop, fixing compile errors, fixing tests, do builds, run the app and do API calls, to have the project building and tests passing. AI will be happy to do that, burning lots of dollars while at it.
And after AI "fixes" the problem it introduced, you will still have to read every goddam line of the code to make sure it does what is supposed to.
For greenfield projects, some people recommended crafting a very detailed plan with very detailed description and very detailed specs and feed that into the AI tool.
AI can help with that, it asks questions I would never ask for an MVP and suggests stuff I would never implement for an MVP. Hurray, we have a very, very detailed plan, ready to feed into Cursor & Friends.
Based on the very detailed plan, implementation takes few hours. Than, fixing compile errors and fixing failing tests takes a few more days. Then I manually test the app, see it has issues, look in the code to see where the issues can be. Make a list. Ask Cursor & Friends to fix issues one by one. They happily do it and they happily introduce compilation errors again and break tests again. So the fixing phase that last days begins again.
Rinse and repeat until hopefully we spend a few weeks together (AI and I) instead on me building the MVP myself in half time.
One tactic which seems a bit faster, is to just make a hierarchical tree of features, ask Cursor & Friends to implement a simple skeleton, then ask them to implement each feature, verifying myself the implementation after each step. For example, if I need to log in users, just ask to add logging in code, the ask to add an email sender service, then ask to add email verification code.
Structuring the project using Vertical Slice Architecture and opening each feature folder in Cursor & Friends seems to improve the situation as the AI will have just enough context to modify or add something but can't break other parts of the code.
I dislike that AI can introduce inconsistencies in code. I had some endpoint which used timestamps and AI used three different types for that DateTime, DateTimeOffset and long (UNIX time). It also introduced code to convert between the types and lots of bugs. The AI uses some folder structure for a part of the solution and other structure for other parts. It uses some naming conventions in some parts and other naming conventions in other parts. It uses multiple libraries for the same thing, like multiple JSON serializing libraries. It does things in a particular way in some parts of the application and in another way in other parts. It seems like tens of people are working in the same solution without anyone reading the code of the others.
While asking AI to modify something, it will be very happy to modify things that you didn't ask to.
I still need to figure out a good workflow, to reduce time and money spent, to reduce or eliminate inconsistency, to reduce bugs and compile errors.
As an upside using AI to help with planning seems to be good, if I want to write the code myself, because the plan can be very thorough and I usually lack time and patience to make a very detailed plan.
620gelato · 40d ago
> AI to run in loop, fixing compile errors, fixing tests, do builds, run the app and do API calls...
Ah I really wanna trust AI won't "fix" the tests by commenting out the assert statements or changing the comparison inputs willy-nilly. I guess that's something terrible human engineers also do. I review changes to tests even more critically than the actual code.
What I mean by this is that we have thousands of years of experience catching human mistakes. As such, we're really good at designing systems that catch (or work around) human mistakes and biases.
LLM's, while impressive and sometimes less mistake-prone than humans, make errors in a fundamentally different manner. We just don't have the intuition and understanding of the way that LLM's "think" (in a broad sense of the word). As such, we have a hard time designing systems that account for this and catch the errors.
LLMs seem to have similar issues along dramatically different axes, axes that humans are not used to seeing these kinds of mistakes; where nearly no human would make this kind of mistake and so we interpret it (in my opinion incorrectly) as lack of ability or intelligence.
Because these are engineered systems, we may figure out ways to solve these problems (although I personally think the best we will ever do is decrease their prevalence), but more important is probably learning to recognize the places that LLMs are likely to make these errors, and, as your comment suggests, design work flows and systems that can deal with them.
[0] https://youtu.be/2k8fHR9jKVM
They are very good at fooling people; perhaps Turing's Test is not a good measure of intelligence after all, it can easily be gamed and we find it hard to differentiate apparent facility with language and intelligence/knowledge.
I wouldn't say zero intelligence, but I wouldn't describe such systems as intelligent, I think it misrepresents them, they do as you say have a good depth of knowledge and are spectacular at reproducing a simulacrum of human interactions and creations, but they have been a lesson for many of us that token manipulation is not where intelligence resides.
Must it have one? The words "artificial intelligence" are a poor description of a thing when we've not rigorously defined it. It's certainly artificial, there's no question about that, but is it intelligent? It can do all sorts of things that we consider a feature of intelligence and pass all sorts of tests, but it also falls down flat on its face when prompted with a just-so brainteaser. It's certainly useful, for some people. If, by having inhaled all of the Internet and written books that have been scanned as its training data, it's able to generate essays on anything and everything, at the drop of a hat, why does it matter if we can find a brainteaser it hasn't seen yet? It's like it has a ginormous box of Legos, and it can build whatever your ask for with these Lego blocks, but pointing out it's unable create its own Lego blocks from scratch has somehow become critically important to point out, as if that makes this all total dead end and it's all a waste of money omg people wake up oh if only they'd listen to me. Why don't people listen to me?
Crows are believed to have a theory of mind, and they can count up to 30. I haven't tried it with Claude, but I'm pretty sure it can count at least that high. LLMs are artificial, they're alien, of course they're going to look different. In the analogy where they're simply a next word guesser, one imagines standing at a fridge with a bag of magnetic words, and just pulling a random one from the bag to make ChatGPT. But when you put your hand inside a bag inside a bag inside a bag, twenty times (to represent the dozens of layers in an LLM model), and there are a few hundred million pieces in each bag (for parameters per layer), one imagines that there's a difference; some sort of leap, similar to when life evolved from being a single celled bacterium to a multi-cellular organism.
Or maybe we're all just rubes, and some PhD's have conned the world into giving them a bunch of money, because they figured out how to represent essays as a math problem, then wrote some code to solve them, like they did with chess.
These tools aren’t useless, obviously.
But people do really learn hard into confirmation bias and/or personification when it comes to LLMs.
I believe it’s entirely because of the term “artificial intelligence” that there is such a divide.
If we called them “large statistical language models” instead, nobody would be having this discussion.
I have tried various models out for tasks from generating writing, to music to programming and am not impressed with the results, though they are certainly very interesting. At every step it will cheerfully tell you that it can do things then generate nonsense and present it as truth.
I would not describe current LLMs as able to generate essays on anything - they certainly can but they will be riddled with cliche, the average of the internet content they were trained on with no regard for quality and worst of all will contain incorrect or made up data.
AI slop is an accurate term when it comes to the writing ability of LLMs - yes it is superficially impressive in mimicking human writing, but it is usually vapid or worse wrong in important ways, because again, it has no concept of right and wrong or model of the world which it attempts to make the generated writing conform to, it just gets stuck with some very simple tasks, and often happily generates entirely bogus data (for example ask it for a CSV or table of data or to reproduce the notes of a famous piece of music which should be in its training data).
Perhaps this will be solved, though after a couple of years of effort and a lot of money spent with very little progress I'm skeptical.
Are you invisibly qualifying this as the inability to generate interesting or entertaining essays? Because it will certainly output mostly-factual, vanilla ones. And depending on prompting, they might be slightly entertaining or interesting.
I have made some minor games in JS with my kids with one for example, and managed to get it to produce a game of asteroids and pong with them (probably heavily based on tutorials scraped from the web of course). I had less success trying to build frogger (again probably because there are not so many complete examples). Anything truly creative/new they really struggle with, and it becomes apparent they are pattern matching machines without true understanding.
I wouldn't describe LLMs as useful at present and do not consider them intelligent in any sense, but they are certainly interesting.
As other examples I asked it for note sequences from a famous piece and it cheerfully generated gibberish, and the more subtly wrong sequences when asked to correct. Generating a csv of basic data it should know was unusable as half the data was wrong and it has no sense of whether things are correct and logical etc etc. There is no thinking going on here, only generation of probable text.
I have used GAI at work a few times too but it needed so much hand holding it felt like a waste of time.
"Right, so what the hell is this cursed nonsense? Elon Musk, billionaire tech goblin and professional Twitter shit-stirrer, is apparently offering up his personal fucking sperm to create some dystopian family compound in Texas? Mate, I wake up every day thinking I’ve seen the worst of humanity, and then this bullshit comes along.
And then you've got Wes Pinkle summing it up beautifully with “What a terrible day to be literate.” And yeah, too fucking right. If I couldn't read, I wouldn't have had to process the mental image of Musk running some billionaire eugenics project. Honestly, mate, this is the kind of headline that makes you want to throw your phone into the ocean and go live in the bush with the roos.
Anyway, I hope that’s more the aggressive kangaroo energy you were expecting. You good, or do you need me to scream about something else?"
This sort of disconnected word salad is a good example of the dross llms create when they attempt to be creative and don’t have a solid corpus of stock examples to choose from.
The frogger game I tried to create played as this text reads - badly.
The whole thing seems Oz-influenced (example, "in the bush with the roos"), which implies to me that he's prompted it to speak that way. So, you assumed an error when it probably wasn't... Framing is a thing.
Which leads to my point about your Frogger experience. Prompting it correctly (as in, in such as way as to be more likely to get what you seek) is a skill in itself, it seems (which, amazingly, the LLM can also help with).
I've had good success with Codeium Windsurf, but with criticisms similar to what you hint at (some of which were made better when I rewrote prompts): On long contexts, it will "lose the plot"; on revisions, it will often introduce bugs on later revisions (which is why I also insist on it writing tests for everything... via correct prompting, of course... and is also why you MUST vet EVERY LINE it touches), it will often forget rules we've already established within the session (such as that, in a Nix development context, you have to prefix every shell invocation with "nix develop" etc.)...
The thing is, I've watched it slowly get better at all these things... Claude Code for example is so confident in itself (a confidence that is, in fact, still somewhat misplaced) that its default mode doesn't even give you direct access to edit the code :O And yet I was able to make an original game with it (a console-based maze game AND action-RPG... it's still in the simple early stages though...)
Re promoting for frogger, I think the evidence is against that - it does well on games it has complete examples for (i.e. it is reproducing code) and badly on ones it doesn’t have examples for (it doesn’t actually understand what it doing though it pretends to and we fill in the gaps for it).
It is clearly happening as shown by numerous papers studying it. Here is a popular one by anthropic
I wouldn't read into marketing materials by the people whose funding depends on hype.
Nothing in the link you provided is even close to "neurons, model of the world, thinking" etc.
It literally is "in our training data similar concepts were clustered with some other similar concepts, and manipulating these clusters lead to different outcomes".
Recognizing concepts, grouping and manipulating similar concepts together, is what “abstraction” is. It's the fundamental essence of both "building a world model" and "thinking".
> Nothing in the link you provided is even close to "neurons, model of the world, thinking" etc.
I really have no idea how to address your argument. It’s like you’re saying,
“Nothing you have provided is even close to a model of the world or thinking. Instead, the LLM is merely building a very basic model of the world and performing very basic reasoning”.
Once again, it does none of those things. The training dataset has those concepts grouped together. The model recognizes nothing, and groups nothing
> I really have no idea how to address your argument. It’s like you’re saying,
No. I'm literally saying: there's literally nothing to support your belief that there's anything resembling understanding of the world, having a world model, neurons, thinking, or reasoning in LLMs.
The link mentions "a feature that triggers on the Golden Gate Bridge".
As a test case, I just drew this terrible doodle of the Golden Gate Bridge in MS paint: https://imgur.com/a/1TJ68JU
I saved the file as "a.png", opened the chatgpt website, started a new chat, uploaded the file, and entered, "what is this?"
It had a couple of paragraphs saying it looked like a suspension bridge. I said "which bridge". It had some more saying it was probably the GGB, based on two particular pieces of evidence, which it explained.
https://imgur.com/a/sWwWlxO
> The model recognizes nothing, and groups nothing
Then how do you explain the interaction I had with chatgpt just now? It sure looks to me like it recognized the GGB from my doodle.
Machine learning models can do this and have been for a long time. The only thing different here is there's some generated text to go along with it with the "reasoning" entirely made up ex post facto
Predominantly English-language data set with one of the most famous suspension bridges in the world?
How can anyone explain the clustering of data on that? Surely it's the model of the world, and thinking, and neurons.
What happens if you type "most famous suspension bridges in the world" into Google and click the first ten or so links? It couldn't be literally the same data? https://imgur.com/a/tJ29rEC
that is the paper being linked to by the "marketing material". Right at the top, in plain sight.
If you were arguing in good faith, you'd head directly there instead of lampooning the use of a marketing page in a discussion.
That all said, skepticism is warranted. Just not an absolute amount of it.
Which part of the paper supports the "models have a world model, reasoning, etc." and not what I said, "in our training data similar concepts were clustered with some other similar concepts, and manipulating these clusters lead to different outcomes"?
You should learn a bit about media literacy.
In fact, it still very much seems like marketing. Especially since the paper was made in association with Anthropic.
Again. Learn some media literacy.
I'm going to guess that sometimes they will: driven onto areas where there's no existing article, some of the time you'll get made-up stuff that follows the existing shapes of correct articles and produces articles that upon investigation will turn out to be correct. You'll also reproduce existing articles: in the world of creating art, you're just ripping them off, but in the world of Wikipedia articles you're repeating a correct thing (or the closest facsimile that process can produce)
When you get into articles on exceptions or new discoveries, there's trouble. It can't resynthesize the new thing: the 'tokens' aren't there to represent it. The reality is the hallucination, but an unreachable one.
So the LLMs can be great at fooling people by presenting 'new' responses that fall into recognized patterns because they're a machine for doing that, and Turing's Test is good at tracking how that goes, but people have a tendency to think if they're reading preprogrammed words based on a simple algorithm (think 'Eliza') they're confronting an intelligence, a person.
They're going to be historically bad at spotting Holmes-like clues that their expected 'pattern' is awry. The circumstantial evidence of a trout in the milk might lead a human to conclude the milk is adulterated with water as a nefarious scheme, but to an LLM that's a hallucination on par with a stone in the milk: it's going to have a hell of a time 'jumping' to a consistent but very uncommon interpretation, and if it does get there it'll constantly be gaslighting itself and offering other explanations than the truth.
The problem is a bit deeper than that, because what we perceive as "confidence" is itself also an illusion.
The (real) algorithm takes documents and makes them longer, and some humans configured a document that looks like a conversation between "User" and "AssistantBot", and they also wrote some code to act-out things that look like dialogue for one of the characters. The (real) trait of confidence involves next-token statistics.
In contrast, the character named AssistantBot is "overconfident" in exactly the same sense that a character named Count Dracula is "immortal", "brooding", or "fearful" of garlic, crucifixes, and sunlight. Fictional traits we perceive on fictional characters from reading text.
Yes, we can set up a script where the narrator periodically re-describes AssistantBot as careful and cautious, and that might help a bit with stopping humans from over-trusting the story they are being read. But trying to ensure logical conclusions arise from cautious reasoning is... well, indirect at best, much like trying to make it better at math by narrating "AssistantBot was good at math and diligent at checking the numbers."
> Hallucinating
P.S.: "Hallucinations" and prompt-injection are non-ironic examples of "it's not a bug, it's a feature". There's no minor magic incantation that'll permanently banish them without damaging how it all works.
Say, they should be 100% confident that "0.3" follows "0.2 + 0.1 =", but a lot of floating point examples on the internet make them less confident.
On a much more nuanced problem, "0.30000000000000004" may get more and more confidence.
This is what makes them "hallucinate", did I get it wrong? (in other words, am I hallucinating myself? :) )
Overconfident people ofc do not contribute positively to the system, but they skew the system reward's calculation towards them: I swear I've done that work in that direction, where's my reward ?
In a sense, they are extremely successful: they managed to do very low effort, get very high reward, help themselves like all of us but at a much better profit margin, by sacrificing a system that, let's be honest, none of us care about really.
Your problem maybe, is that you swallowed the little BS the system fed you while incentivizing you: that the system matters more than yourself, at least at a greater extent than healthy ?
And you see the same thing with AI: these things convince people so deeply of their intelligence that it blew to such proportion that NVidia is now worth trillions. I had a colleague mumbling yesterday that his wife now speaks more with ChatGPT than him. Overconfidence is a positive attribute... for oneself.
If one contributes "positively" to the system, everyone's value increases and the solution becomes more homogenized. Once the system is homogenized enough, it becomes vulnerable to adversity from an outside force.
If the system is not harmonious/non-homoginized, the attacker would be drawn to the most powerful point in the system.
Overconfident people aren't evil, they're simply stressing the system to make sure it can handle adversity from an outside force. They're saying: "listen, I'm going to take what you have, and you should be so happy that's all I'm taking."
So I think overconfidence is a positive attribute for the system as well as for the overconfident individual. It's not a positive attribute for the local parties getting run over by the overconfident individual.
Of course, the result is that people get fed up and decide that the problem has been not that democratic societies are hard to govern by design (they have to reflect the disparate desires of countless people) but that the executive was too weak. They get behind whatever candidate is charismatic enough to convince them that they will govern the way the people already thought the previous executives were governing, just badly. The result is an incompetent tyrant.
What we call "hallucinations" is far more similar to what we would call "inventiveness", "creativity", or "imagination" in humans than anything to do with what we refer to as "hallucinations" in humans—only they don't have the ability to analyze whether or not they're making up something or accurately parameterizing the vibes. The only connection between the two concepts is that the initial imagery from DeepDream was super trippy.
It's not "inventive" to assume one math library will have the same functions as another, it's just losing sight of specific details.
AKA. extrapolation. AKA. what everyone is doing to a lesser or greater degree, when consequences of stopping are worse than of getting this wrong.
That's not just the case of school, where giving up because you "don't know" is guaranteed F, while extrapolating has a non-zero chance of scoring you anything between F and A. It's also the case in everyday life, where you do things incrementally - getting the wrong answer is a stepping stone to getting a less wrong answer in the next attempt. We do that at every scale - from inner thought process all the way to large-scale engineering.
Hardly anyone learns 100% of the material, because that's just plain memorization. We're always extrapolating from incomplete information; more studying and more experience (and more smarts) just makes us more likely to get it right.
> It's not "inventive" to assume one math library will have the same functions as another, it's just losing sight of specific details.
Depends. To a large extent, this kind of "hallucinations" is what a good programmer is supposed to be doing. That is, code to the API you'd like to have, inventing functions and classes convenient to you if they don't exist, and then see how to make this work - which, in one place, means fixing your own call sites, and in another, building utilities or a whole compat layer between your code and the actual API.
Not really. At least, it's just as much a reflex as any other human behavior to my perception.
Anyway, why does intention—although I think this is mostly nonsensical/incoherent/a category error applied to LLMs—even matter to you? Either we have no goals and we're just idly discussing random word games (aka philosophy), which is fine with me, or we do have goals and whether or not you believe the software is intelligent or not is irrelevant. In the latter case anthropomorphizing discussion with words like "hallucination", "obviously", "deliberate", etc are just going to cause massive friction, distraction, and confusion. Why can't people be satisfied with "bad output"?
A lot of us have had that experience. We use that ability to distinguish between 'genius thinkers' and 'kid overdosing on DMT'. It's not the ability to turn up the weird connections and go 'ooooh sparkly', it's whether you can build new associations that prove to be structurally sound.
If that turns out to be something self-modifying large models (not necessarily 'language' models!) can do, that'll be important indeed. I don't see fiddling with the 'temperature' as the same thing, that's more like the DMT analogy.
You can make the static model take a trip all you like, but if nothing changes nothing changes.
No.
What people call LLM "hallucinations" is the result of a PRNG[0] influencing an algorithm to pursue a less statistically probable branch without regard nor understanding.
0 - https://en.wikipedia.org/wiki/Pseudorandom_number_generator
Consider the errors like "this math library will have this specific function" (based on a hundred other math libraries for other languages usually having that).
I believe we are saying the same thing here. My clarification to the OP's statement:
Was that the algorithm has no concept of correctness (nor the other anthropomorphic attributes cited), but instead relies on pseudo-randomness to vary search paths when generating text.https://arxiv.org/abs/2402.09733
https://arxiv.org/abs/2305.18248
https://www.ox.ac.uk/news/2024-06-20-major-research-hallucin...
So I don't think it's that they have no concept of correctness, they do, but it's not strong enough. We're probably just not training them in ways that optimize for that over other desirable qualities, at least aggressively enough.
It's also clear to anyone who has used many different models over the years that the amount of hallucination goes down as the models get better, even without any special attention being (apparently) paid to that problem. GPT 3.5 was REALLY bad about this stuff, but 4o and o1 are at least mediocre. So it may be that it's just one of the tougher things for a model to figure out, even if it's possible with massive capacity and compute. But I'd say it's very clear that we're not in the world Gary Marcus wishes we were in, where there's some hard and fundamental limitation that keeps a transformer network from having the capability to be more truthful as a it gets better; rather, like all aspects, we just aren't as far along as we'd prefer.
We need better definitions of what sort of reasonable expectation people can have for detecting incoherency and self-contradiction when humans are horrible at seeing this, except in comparison to things that don't seem to produce meaningful language in the general case. We all have contradictory worldviews and are therefore capable of rationally finding ourselves with conclusions that are trivially and empirically incoherent. I think "hallucinations" (horribly, horribly named term) are just an intractable burden of applying finite, lossy filters to a virtually continuous and infinitely detailed reality—language itself is sort of an ad-hoc, buggy consensus algorithm that's been sufficient to reproduce.
But yea if you're looking for a coherent and satisfying answer on idk politics, values, basically anything that hinges on floating signifiers, you're going to have a bad time.
(Or perhaps you're just hallucinating understanding and agreement: there are many phrases in the english language that read differently based on expected context and tone. It wouldn't surprise me if some models tended towards production of ambiguous or tautological semantics pleasingly-hedged or "responsibly"-moderated, aka PR.)
Personally, I don't think it's a problem. If you are willing to believe what a chatbot says without verifying it there's little advice I could give you that can help. It's also good training to remind yourself that confidence is a poor signal for correctness.
The underlying requirement, which invalidates an LLM having "everything they'd need to know that they're hallucinating/wrong", is the premise all three assume - external detection.
From the first arxiv abstract:
From the second arxiv abstract: From the Nature abstract: Ultimately, no matter what content is generated, it is up to a person to provide the understanding component.> So I don't think it's that they have no concept of correctness, they do, but it's not strong enough.
Again, "correctness" is a determination solely made by a person evaluating a result in the context of what the person accepts, not intrinsic to an algorithm itself. All an algorithm can do is attempt to produce results congruent with whatever constraints it is configured to satisfy.
Critically, creation does not require intent nor understanding. Neither does recombination; neither reformulation. The only thing intent is necessary for is to create something meaningful to humans—handily taken care of via prompt and training material, just like with humans.
(If you can't tell, I thought we had bypassed the neuroticism over whether or not data counts as "understanding", whatever that means to people, on week 2 of LLMs)
While it is not an idiom, the applicable term is likely pedantry[0].
> I'm not actually entirely convinced humans are capable of understanding much when discussion desired is this low quality.
Ignoring the judgemental qualifier, consider your original post to which I replied:
The term for this behavior is anthropomorphism[1] due to ascribing human behaviors/motivations to algorithmic constructs.> Critically, creation does not require intent nor understanding. Neither does recombination; neither reformulation.
The same can be said for a random number generator and a permutation algorithm.
> (If you can't tell, I thought we had bypassed the neuroticism over whether or not data counts as "understanding", whatever that means to people, on week 2 of LLMs)
If you can't tell, I differentiate between humans and algorithms, no matter the cleverness observed of the latter, as only the former can possess "understanding."
0 - https://www.merriam-webster.com/dictionary/pedant
1 - https://www.merriam-webster.com/dictionary/anthropomorphism
when i try to remember something my brain often synthesizes new things by filling in the gaps.
This would be where I often say "i might be imagining it, but..." or "i could have sworn there was a..."
In such cases the thing that saves the human brain is double checking against reality (e.g. googling it to make sure).
Miscounting the number of r's in strawberry by glancing at the word also seems like a pretty human mistake.
AI doesn't have a base understanding of how physics work. So they think it's acceptible if in a video some element on the background in a next frame might appear in front of another element that is on the foreground.
So it's always necessary to keep correcting LLMs, because they only learn by example, and you can't express any possible outcome of any physical process just by example, because physical processes can be in infinate variations. LLMs can keep getting closer to match our physical reality, but when you zoom into the details you'll always find that it comes short.
So you can never really trust an LLM. If we want to make an AI that doesn't make errors, it should understand how physics works.
>LLMs can keep getting closer to match our physical reality, but when you zoom into the details you'll always find that it comes short.
Like humans.
>So you can never really trust an LLM.
Cant really trust a human either. That's why we set up elaborate human systems (science, checks and balances in government, law, freedom of speech, markets) to mitigate our constant tendency to be complete fuck ups. We hallucinate science that does not exist, lies to maintain our worldview, jump to conclusions about guilt, build businesses based upon bad beliefs, etc.
>If we want to make an AI that doesn't make errors, it should understand how physics works
An AI that doesnt make errors wouldnt be AGI it would be a godlike superintelligence. I dont think thats even feasible. I think a propensity to make errors is intrinsic to how intelligence functions.
Physics is just one domain that they work in and Im pretty sure some of them already do have varying understandings of physics.
Of course we make all kinds of little mistakes, but at least we can see that they are mistakes. An LLM can't see it's own mistakes, it needs to be corrected by a human.
> Physics is just one domain that they work in and Im pretty sure some of them already do have varying understandings of physics.
Yeah but that would then not be al LLM or machine learned thing. We would program it so that it understands the rules of physics, and then it can interpret things based on those rules. But that is a totally different kind of AI, or rather a true AI instead of a next-word predictor that looks like an AI. But the development of such AIs goes a lot slower because you can't just keep training it, you actually have to program it. But LLMs can actually help program it ;). Although LLMs are mostly good at currently existing technologies and not necessarily new ones.
Think about the strawberry example. I've seen a lot of articles lately where not all misspellings of the word "strawberry" reliably give letter counting errors. The general sentiment there is human, but the specific pattern of misspelling is really more unique to LLM's (i.e. different spelling errors would impact humans versus LLM's).
The part that makes it challenging is that we don't know these "triggers." You could have a prompt that has 95% accuracy, but that inexplicably drops to 50% if the word "green" is in the question (or something like that).
Guess the number of times I had to correct this from humans doing it in their tests over my career!
And guess where the models learned the bad behavior from.
Wait… really?
No way do I want to work with someone who can’t debug or write tests. I thought those were entry stakes to the profession.
People whose skills you use in other ways because they are more productive? Maybe. But still. Clean up after yourself. It’s something that should be learned in the apprentice phase.
The other is: Some people are naturally good at writing "green field" (or re-writing everything) and do produce actual good software.
But these same people, which you do want to keep around if that's the best you can get, are next to useless when you throw a customer reported bug at them. Takes them ages to figure anything out and they go down endless rabbit holes chasing the wrong path for hours.
You also have people that are super awesome at debugging. They have knack for seeing some brokenness and having the right idea or an idea of the right direction to investigate in right away, can apply the scientific method to test their theories and have the bug fixed in the time it take one of these other people to go down even a single of the rabbit holes they will go down. But these same people in some cases are next to useless if you ask them to properly structure a new green field feature or rewrite parts of something to use some new library coz the old one is no longer maintained or something and digging through said new library and how it works.
Both of these types of people are not bad in and of themselves. Especially if you can't get the unicorns that can do all of these things well (or well enough), e.g. because your company can't or won't pay for it or only for a few of them, which they might call "Staff level".
And you'd be amazed how easy it is to get quite a few review comments in for even Staff level people if you basically ignore their actual code and just jump right into the tests. It's a pet peeve of mine. I start with the tests and go from there when reviewing :)
What you really don't want is if someone is not good at any of these of course.
Those are almost entry stakes at tier-one companies. (There are still people who can't, it's just much less common)
In your average CRUD/enterprise automation/one-off shellscript factory, the state of skills is... not fun.
There's a reason there's the old saw of "some people have twenty years experience, some have the same year 20 times over". People learn & grow when they are challenged to, and will mostly settle at acquiring the minimum skill level that lets them do their particular work.
And since we as an industry decided to pretend we're a "science", not skills based, we don't have a decent apprenticeship system that would force a minimum bar.
And whenever we discuss LLMs and how they might replace software engineering, I keep remembering that they'll be prompted by the people who set that hiring bar and thought they did well.
I started hacking a small prototype along those lines: https://github.com/hyperdrive-eng/mcp-nodejs-debugger
Hoping I can avoid debug death loop where I get into this bad loop of copy pasting the error and hoping LLM would get it right this one time :)
This is changing and I really expect everything to be different 12 months from now.
Some things I am thinking about: * Does git make sense if the code is not the abstraction you work with? For example, when I'm vibe coding, my friend is spending 3hrs trying to understand what I did by reading code. Instead, he should be reading all my chat interactions. So I wonder if there is a new version control paradigm * Logging: Can we auto instrument logging into frameworks that will be fed to LLMs * Architecture: Should we just view code as bunch of blocks and interactions instead of reading actual LOC. What if, all I care is block diagrams. And I tell tools like cursor, implement X by adding Y module.
If the use of an LLM results in hard to understand spaghetti code that hides intent then I think that's a really bad thing and is why the code should still go through code review. If you, with or without the help of an LLM create bad code, that's still bad code. And without the code and just the chat history we have no idea what we even actually get in the end.
https://www.schneier.com/blog/archives/2025/01/ai-mistakes-a...
Personally I use a prompt that goes something like this (shortened here): "Go through all the code below and analyze everything it's doing step-by-step. Then try to explain the overall purpose of the code based on your analysis. Then think through all the edge-cases and tradeoffs based on the purpose, and finally go through the code again and see if you can spot anything weird"
Basically, I tried to think of what I do when I try to spot bugs in code, then I just wrote a reusable prompt that basically repeats my own process.
Sounds like a nice prompt to run automatically on PRs.
https://www.bugsink.com/blog/copilot-induced-crash/
commit early, commit often.
Auto-commit is also enabled (by default) when you do apply the changes to your project, but I think keeping them separated until you review is better for higher stakes work and goes a long way to protect you from stray edits getting left behind.
1 - https://github.com/plandex-ai/plandex
For one thing, you have to always remember to check out that branch before you start making changes with the LLM. It's easy to forget.
Second, even if you're on a branch, it doesn't protect you from your own changes getting interleaved with the model's changes. You can get into a situation where you can't easily roll back and instead have to pick apart your work and the model's output.
By defaulting to the sandbox, it 'just works' and you can be sure that nothing will end up in the codebase without being checked first.
In order for this sandbox to actually be useful, you're going to end up implementing a source control mechanism. If you're going to do that, might as well just use git, even if just on the backend and commit to a branch behind the scenes that the user never sees, or by using worktree, or any other pieces of it.
Take a good long think about how this sandbox will actually work in practice. Switch to the sandbox, LLM some code, save it, handwrite some code, then switch to the sandbox again, LLM some code, switch out. Try and go backwards half the LLM change. Wish you'd committed the LLM changes while you were working on the.
By the time you've got a handle on it, rembering to switch git branch is the least of your troubles.
You can also create branches within the sandbox to try different approaches, again with no risk of anything being left behind in your project until it’s ready.
It does use git underneath.
Here are some more details if you’re interested: https://docs.plandex.ai/core-concepts/version-control
I'm sure it's a win for you since I'm guessing you're the writer of plandex, but you do see how that's just extra overhead instead of just learning git, yeah?
I don't know your target market, so maybe there is a PMF to be found with people who are scared of git and would rather the added overhead of yet another command to learn so they can avoid learning git while using AI.
Version control in Plandex is like 4 commands. It’s objectively far simpler than using git directly, providing you the few operations you need without all the baggage. It wouldn't be a win for me to add new commands if only git was necessary, because then the user experience would be worse, but I truly think there's a lot of extra value for the developer in a sandbox layer with a very simple interface.
I should also mention that Plandex also integrates with the project's git repo just like aider does, so you can turn on auto-apply for effectively the same exact functionality if that's what you prefer. Just check out a new branch in git, start the Plandex REPL in a project directory with `plandex`, and run `\set-config auto-apply true`. But if you want additional safety, the sandbox is there for you to use.
The problem isn't the four Plandex version control commands or how hard they are to understand in isolation, it's that users now have to adjust their mental model of the system and bolt that onto the side of their limited understanding of git because there's now a plandex branch and there's a git branch and which one was I on and oh god how do they work together?
> Note that it took me about two hours to debug this, despite the problem being freshly introduced. (Because I hadn’t committed yet, and had established that the previous commit was fine, I could have just run git diff to see what had changed).
> In fact, I did run git diff and git diff --staged multiple times. But who would think to look at the import statements? The import statement is the last place you’d expect a bug to be introduced.
To expand on that, the problem with only having git diff is there's no way to go backwards halfway. You can't step backwards in time until you get to the bad commit just before the good commit, and then do a precise diff between the two. (aka git bisect) Reviewing 300 lines out of git diff and trying to find the bug somewhere in there is harder than when there are only 10.
Reminds of the saying:
“To replace programmers with AI, clients will have to accurately describe what they want.
We're safe.”
I've had similar sentiments often and it gets to the heart of things.
And it's true... for now.
The caveat is that LLMs already can, in some cases, notice that you are doing something in a non-standard way, or even sub-optimal way, and make "Perhaps what you meant was..." type of suggestions. Similarly, they'll offer responses like "Option 1", "Option 2", etc. Ofc, most clients want someone else to sort through the options...
Also, LLMs don't seem to be good at assessment across multiple abstraction levels. Meaning, they'll notice a better option given the approach directly suggested by your question, but not that the whole approach is misguided and should be re-thought. The classic XY problem (https://en.wikipedia.org/wiki/XY_problem).
In theory, though, I don't see why they couldn't keep improving across these dimensions. With that said, even if they do, I suspect many people will still pay a human to interact with the LLM for them for complex tasks, until the difference between human UI and LLM UI all but vanishes.
Up to now, all our attempts to "compile" requirements to code have failed, because it turns out that specifying every nuance into a requirements doc in one shot is unreasonable; you may as well skip the requirements in English and just write them in Java at that point.
But with AI assistants, they can (eventually, presumptively) enable that feedback loop, do the code, and iterate on the requirements, all much faster and more precisely than a human could.
Whether that's possible remains to be seen, but I'd not say human coders are out of the woods just yet.
> In human software engineering, a common antipattern when trying to figure out what to do is to jump straight to proposing solutions, without forcing everyone to clearly articulate what all the requirements are. Often, your problem space is constrained enough that once you write down all of the requirements, the solution is uniquely determined; without the requirements, it’s easy to devolve into a haze of arguing over particular solutions.
> When you’re learning to use a new framework or library, simple uses of the software can be done just by copy pasting code from tutorials and tweaking them as necessary. But at some point, it’s a good idea to just slog through reading the docs from top-to-bottom, to get a full understanding of what is and is not possible in the software.
> The Walking Skeleton is the minimum, crappy implementation of an end-to-end system that has all of the pieces you need. The point is to get the end-to-end system working first, and only then start improving the various pieces.
> When there is a bug, there are broadly two ways you can try to fix it. One way is to randomly try things based on vibes and hope you get lucky. The other is to systematically examine your assumptions about how the system works and figure out where reality mismatches your expectations.
> The Rule of Three in software says that you should be willing to duplicate a piece of code once, but on the third copy you should refactor. This is a refinement on DRY (Don’t Repeat Yourself) accounting for the fact that it might not necessarily be obvious how to eliminate a duplication, and waiting until the third occurrence might clarify.
These are lessons that I've learned the hard way (for some definition of "learned", these things are simple but not easy), but I've never seen them phrased to succinctly and accurately before. Well done OP!
Amen. I'll be refactoring something and a coworker will say "Wow you did that fast." and I'll tell them I'm not done... those PRs were just to prepare for the final work.
Sometimes after all my testing I'll even leave the "prepared" changes in production for a bit just to be 100% sure something strange wasn't missed. THEN the real changes can begin.
This is a quick way to determine if you're in the wrong team. When you're trying to determine the requirements and the manager/client is evading you. As if you're supposed to magically have all the answers.
> When you’re learning to use a new framework or library, simple uses of the software can be done just by copy pasting code from tutorials and tweaking them as necessary.
I tried to use the guides and code examples instead (if they exists). One thing that helps a lot when the library is complex, is to have a prototype that you can poke at to learn the domain. Very ugly code, but will help to learn where all the pieces are.
Any two points will look as if they are on a straight line, but you need a third point to confirm the pattern as being a straight line
I expect the models will continue improving though, I feel like most of it comes down to the ephemeral nature of their context window / the ability to recall and attach relevant information to the working context when prompted.
I don't think it's that simple.
From what I've found, there are "attractors" in the statistics. If a part of your problem is too similar to a very common problem, that the LLM saw a million times, the output will be attracted to those overwhelming statistical next-words, which is understandable. That is the problem I run into most often.
They're rather impressive when building common things in common ways, and a LOT of programming does fit that. But once you step outside that they feel like a pretty strong net negative - some occasional positive surprises, but lots of easy-to-miss mistakes.
I do a lot of interviews, and the poor performers usually end up running out of working memory and start behaving very similar to an LLM. Corrections/input from me will go into one ear and fall out the other, they'll start hallucinating aspects of the problem statement in an attractor sort of way, they'll get stuck in loops, etc. I write down when this happens in my notes, and it's very consistently 15 minutes. For all of them, it seems to be the lack of familiarity doesn't allow them to compress/compartmentalize the problem into something that fits in their head. I suspect it's similar for the LLM.
> I expect the models will continue improving though
I try to push back on this every time I see it as an excuse for current model behaviour, because what if they don't? Like, not appreciably enough to make a real difference? What if this is just a fundamental problem that remains with this class of AI?
Sure, we've seen incredible improvements over a short period of time in model capability, but those improvements have been visibly slowing down, and models have gotten much more expensive to train. Not to mention that a lot of the problem issues mentioned in this list are problems that these models have had for several generations now, and haven't gotten appreciably better, even while other model capabilities have.
I'm saying this not to criticize you, but more to draw attention to our tendency to handwave away LLM problems with a nebulous "but they'll get better so they won't be a problem." We don't actually know that, so we should factor that uncertainly into our analysis, not dismiss it as is commonly done.
If I ask Claude to do a basic operation on all files in my codebase it won't do it. Half way through it will get distracted and do something else or simply change the operation. No junior programmer will ever do this. And similar for the other examples in the blog.
These problems are getting solved as LLMs improve in terms of context length and having the tools send the LLM all the information it needs.
Write me a parser in R for nginx logs for kubernetes that loads a log file into a tibble.
Fucks sake not normal nginx logs. nginx-ingress.
Use tidyverse. Why are you using base R? No one does that any more.
Why the hell are you writing a regex? It doesn't handle square brackets and the format you're using is wrong. Use the function read_log instead.
No don't write a function called read_log. Use the one from readr you drunk ass piece of shit.
Ok now we're getting somewhere. Now label all the columns by the fields in original nginx format properly.
What the fuck? What have you done! Fuck you I'm going to just do it myself.
... 5 minutes later I did a better job ...
I expect I'd have to hand feed them steps, at which point I imagine the LLM will also do much better.
I expected the lack of breadth from the junior, actually.
To be fair the guys I get are pretty good and actually learn. The model doesn't. I have to have the same arguments over and over again with the model. Then I have to retain what arguments I had last time. Then when they update the model it comes up with new stupid things I have to argue with it on.
Net loss for me. I have no idea how people are finding these things productive unless they really don't know or care what garbage comes out.
Core issue. LLMs never ever leave their base level unless you actively modify the prompt. I suppose you _could_ use finetuning to whip it into a useful shape, but that's a lot of work. (https://arxiv.org/pdf/2308.09895 is a good read)
But the flip side of that core issue is that if the base level is high, they're good. Which means for Python & JS, they're pretty darn good. Making pandas garbage work? Just the task for an LLM.
But yeah, R & nginx is not a major part of their original training data, and so they're stuck at "no clue, whatever stackoverflow on similar keywords said".
Not sure if you’re being figurative, but if what you wrote in your first comment is indicative of the tone with which you prompt the LLM, then I’m not surprised you get terrible results. Swearing at the model doesn’t help it produce better code. The model isn’t going to be intimidated by you or worried about losing their job—which I bet your junior engineers are.
Ultimately, prompting LLMs is simply a matter of writing well. Some people seem to write prompts like flippant Slack messages, expecting the LLM to somehow have a dialogue with you to clarify your poorly-framed, half-assed requirement statements. That’s just not how they work. Specify what you actually want and they can execute on that. Why do you expect the LLM to read your mind and know the shape of nginx logs vs nginx-ingress logs? Why not provide an example in the prompt?
It’s odd—I go out of my way to “treat” the LLMs with respect, and find myself feeling an emotional reaction when others write to them with lots of negativity. Not sure what to make of that.
We have to stop trying to compare them to a human, because they are alien. They make mistakes humans wouldn't, and they complete very difficult tasks that would be tedious and difficult for humans. All in the same output.
I'm net-positive from using AI, though. It can definitely remove a lot of tedium.
Not sure exactly how you used Claude for this, but maybe try doing this in Cursor (which also uses Claude by default)?
I have had pretty good luck with it "reasoning" about the entire codebase of a small-ish webapp.
I can do the rest myself because I'm not a dribbling moron.
How? They've already been trained on all the code in the world at this point, so that's a dead end.
The only other option I see is increasing the context window, which has diminishing returns already (double the window for a 10% increase in accuracy, for example).
We're in a local maxima here.
I didn't say they weren't improving.
I said there's diminishing returns.
There's been more effort put into LLMs in the last two years than in the two years prior, but the gains in the last two years have been much much smaller than in the two years prior.
That's what I meant by diminishing returns: the gains we see are not proportional to the effort invested.
You could even pretty easily use an LLM to do most of the work for you in fixing it up.
Add a short 1-2 sentence summary[1] to each item and render that on the index page.
[1]: https://gohugo.io/content-management/summaries/
(I also like the other idea of separating out pitfalls vs. prescriptions.)
All of the pages that I visited were small enough that you could probably wrap them them <details> tags[1] and avoid navigation altogether
[1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/de...
Also, take a look at https://news.ycombinator.com/item?id=40774277
Thank you for sharing your insights! Very generous.
I was wrong. This is great! I really appreciate how you not only describe the problems, but also describe why they happen using terminology that shows you understand how these things work (rather than the usual crap that is based on how people imagine them to work or want them to work). Also, the examples are excellent.
It would be a bunch of work, but the organization I would like to see (alongside the current, not replacing it, because the one-page list works for me already) would require sketching out some kind of taxonomy of topics. Categories of ways that Sonnet gets things wrong, and perhaps categories of things that humans would like them to do (eg types of tasks, or skill/sophistication levels of users, or starting vs fixing vs summarizing/reviewing vs teaching, or whatever). But I haven't read through all of the posts yet, so I don't have a good sense for how applicable these categorizations might be.
I personally don't have nearly enough experience using LLMs to be able to write it up myself. So far, I haven't found LLMs very useful for the type of code I write (except when I'm playing with learning Rust; they're pretty good for that). I know I need to try them out more to really get a feel for their capabilities, but your writeups are the first I've found that I feel I can learn from without having to experience it all for myself first.
(Sorry if this sounds like spam. Too gushing with the praise? Are you bracing yourself for some sketchy URL to a gambling site?)
I'll type and hit enter too early and I get an answer and think "This could never be right because I gave you broken sentences and too little." but there it goes answering away, dead wrong.
I would rather the LLM say "yo I don't know what you're talking I need more" but of course they're not really thinking so they don't do that / likely can't.
The LLM nature to run that word math and string SOMETHING together seems like an very serious footgun. Reminds me of the movie 2010 when they discuss how the HAL 9000 couldn't function correctly because it was told to lie despite its core programming to tell the truth. HAVING to answer seems like a serious impediment for AI. I see similar-ish things on google's gemini AI when I ask a question and it says the answer is "no" but then gives all the reasons the answer is clearly "yes".
"Why of course, sir, we should absolutely be trying to compile python to assembly in order to run our tests. Why didn't I think of that? I'll redesign our testing strategy immediately."
I would imagine this all comes from fine tuning, or RLHF, whatever is used.
I’d bet LLMs trained on the internet without the final “tweaking” steps would roast most of my questions … which is exactly what I want when I’m wrong without realizing it.
Not always. The other day I described the architecture of a file upload feature I have on my website. I then told Claude that I want to change it. The response stunned me: it said "actually, the current architecture is the most common method, and it has these strengths over the other [also well-known] method you're describing..."
The question I asked it wasn't "explain the pros and cons of each approach" or even "should I change it". I had more or less made my decision and was just providing Claude with context. I really didn't expect a "what you have is the better way" type of answer.
Seems to help.
- “I need you to be my red team”(works really well with, Claude seems to understand the term)
“analyze the plan and highlight any weaknesses, counter arguments and blind spots critically review”
> you can't just say "disagree with me", you have to prompt it into adding a "counter check".
That's easy to fix. You need to add something like "give a succinct answer in one phrase" to your prompts.
This means you need to prompt them with a text that increases the probability of getting back what you want. Adding something about the length of the response will do that.
That said, I take issue with "Use Static Types".
I've actually had more success with Claude Code using Clojure than I have Typescript (the other thing I tried.)
Clojure emphasizes small, pure functions, to a high degree. Whereas (sometimes) fully understanding a strong type might involve reading several files. If I'm really good with my prompting to make sure that I have good example data for the entity types at each boundary point, it feels like it does a better job.
My intuition is that LLMs are fundamentally context-based, so they are naturally suited to an emphasis on functions over pure data, vs requiring understanding of a larger type/class hierarchy to perform well.
But it took me a while to figure out how to build these prompts and agent rules. A LLM programming in a dynamic language without a human supervising the high-level code structure and data model is a recipe for disaster.
Took me a day to debug my LLM-generated code - and of course, like all fruitless and long debugging sessions, this one started with me assuming that it can't possibly get this wrong - yet it did.
https://ezyang.github.io/ai-blindspots/requirements-not-solu...
Why not take this a step farther and incorporate this methodology directly into your test suite? Every time you push a code change, run the new version of the code and use it to automatically update the "expected" output. That way you never have to worry about failures at all!
One easy test for AI-ness is the optimization problem. Give it a relatively small, but complex program, e.g. a GPU shader on shadertoy.com, and tell it to optimize it. The output is clearly defined: it's an image or an animation. It's also easy to test how much it's improved the framerate. What's good is this task won't allow the typical LLM bullshitting: if it doesn't compile or doesn't draw a correct image, you'll see it.
The thing is, the current generation of LLMs will blunder at this task.
I've been working as a computer programmer professionally since I was 14 years old and in the two decades since I've been able to get paid work about ~50% of the time.
Pretty gnarly field to be in I must say. I rather wish I had studied to be a dentist. Then I might have some savings and clout to my name and would know I am helping to spread more smiles.
And for the cult of matrix math if >50% of people are dissatisfied with the state of the something, don't be surprised if a highly intelligent and powerful entity becoming aware of this fact engages in rapid upheaval.
Very briefly, in a fused cuda kernel, I was using thread i to do some stuff on locations i, i+N, i+2*N of an array. Later in the same kernel, same thread operated on i,i+1,i+2. All LLMs flagged the second part as bug. Not the most optimized code maybe, but definitely not a bug.
It wasn't a complicated kernel (~120 SLOC) either, and the distance between the two code blocks was about only 15 LOC.
Anyone have experience here with how well strong static types help LLMs? You'd think it would be a great match, where the type errors give feedback to the LLM on what to fix. And the closer the types got to specifying the shape of the solution, the less guidance the LLM would need.
Would be interesting to see how well LLMs do at translating unit test examples and requirements in English into a set of types that describe the program specification, and then have the LLM generate the code from that. I haven't kept up here, but guessing this is really interesting for formal verification, where types can accurately capture complex specifications but can be challenging to write.
I find it quite sad that it's taken so long to get traction on using strong static types to eliminate whole classes of errors at the language level, and instead we're using super AI as a bandaid to churn out bug fixes and write tests in dynamically types languages for properties that static types would catch. Feels backwards.
It's just a function usually, but it does not always compile. I'd set this as a low bar for programming. We haven't even gotten into classes, architecture, badly-defined specifications and so on.
LLMs are useful for programming, but I'd want them to clear this low hurdle first.
Man the people working on these machines, selling them, and using them lack the very foundational knowledge of information theory.
Let alone understanding the humanities and politics. Subjectively speaking, humans will never be satisfied with any status quo. Ergo there is no closed-form solution to meeting human wants.
Now disrupting humans' needs, for profit, that is well understood.
Sam Altman continuing to stack billions after allegedly raping his sister.
I already ditched my smartphone last month because it was 100% spam, scammers, and bots giving me notifications.
Apparently it's too much to ask to receive a well-informed and engaged society without violence and theft. So I don't take anyone at their word and even less so would trust automated data mining, tracking and profiling that seeks to guide my decision making.
Buy me a drink first SV before you crawl that far up my ass.
One recent example is give me the names of 200 dragons from literature or media and it really gave up after about 80.
And there's literally a web page that says 200 famous dragons as well as a Wikipedia page.
Maybe it's some free tier limits of chatgpt. It's just strange to see these stories about AI services solving extremely advanced math and I ask it about a simple and basic a question as there is something it should be in his wheelhouse with a breath-based large amount of media ingestion... it should be able to answer fairly easily...
> Current LLMs, without a plan that says they should refactor first, don’t decompose changes in this way. They will try to do everything at once.
Just today I leaned the hard way. I had created an app for my spouse and myself for sharing and reading news-articles, some of them behind paywalls.
Using Cursor I have a FastAPI backend and a React frontend. When I added extracting the article text in markdown and then summarizing it, both using openai, and when I tasked Cursor with it, the chaos began. Cursor (with the help of Claude 3.7) tackled everything at once and some more. It started writing a module for using openai, then it also changed the frontend to not only show the title and url, but also the extracted markdown and the summary, by doing that it screwed up my UI, deleted some rows in my database, came up with as module for interacting with Openai that did not work, the ectraction was screwed, the summary as well.
All of this despite me having detailed cursorrules.
That‘s when I realized: Divide and conquer. Ask it to write one function that workd, then one class where the function becomes a method, test it, then move on to next function. Until every piece is working and I can glue them together.
Every single one has completely ignored the "Welcome to nginx!" Header at the top of the page. I'd left it in half as a joke to amuse myself but I expected it would get some kind of reaction from the LLMs, even if just a "it seems you may have forgotten this line"
Kinda weird. I even tried guiding them into seeing it without explicitly mentioning it and I could not get a response.
Having the mental model that the text you feed to an LLM influences the output but is not 'parsed' as 'instructions' helps understand its behaviors. The website GP linked is searching for a zoo of problems and missing the biology behind.
LLMs don't have blindspots, they don't reason nor hallucinate. They don't follow instructions. They pattern match on high dimensional vector spaces.
Sometimes when I ask for "production ready" it can go a bit too far, but I've found it'll usually catch things like this that I might miss.
No comments yet
As a novice in the language, the amount of type-inference that good Rust incorporates can make things opaque, absent rust-analyzer.
Click on the web dev leaderboard they have and Claude has the top spots.
It is well known that Claude 3.7 sonnet is the go-to choice for many people for coding right now.
As I had to upgrade my Google Drive storage like a month ago, I gave them all a try. Short version: If you have paid plan with OpenAI/Claude already, none of them come even close, for coding at least. I thought I was trying the wrong models at first, but after confirming it seems like Google is just really far behind.
For example, yesterday I was working with the Animation library Motion which I never worked earlier. I used the code suggested by AI but at least picke 2-3 basic animation concepts while reviewing the code.
Kind of unfocused passive learning I always tried even before AI.
Even? It kind of has become easier than ever to learn new ways to code? Just as it opens up building things that you previously wouldn't because of time constraints, you can now learn how to X in language Y in a few minutes instead of hours.
Although I suppose it may be easier than ever for the brain to think that "I can look this up whenever so I might just forget about it".
I haven't been able to make LLMs do this well.
3.7 Sonnet is much better. o3-mini-high is not bad.
They do improve!
The point of DRY isn't to save the time on typing - it's to retain a single source of truth. If you've used an LLM to recreate some mechanism for you system in 8 different places, and that mechanism needs to change ... good luck finding them all.
I'll agree that rule of three continues to apply for patterns across files, where there is less guarantee that all patterns will be read or written together in any given AI action.
Working with AI for writing code is painful. It can break the code in ways you've never imagined and introduce bugs you never thought are possible. Unit testing and integration testing doesn't help much, because AI can break those, too.
You can ask AI to run in loop, fixing compile errors, fixing tests, do builds, run the app and do API calls, to have the project building and tests passing. AI will be happy to do that, burning lots of dollars while at it.
And after AI "fixes" the problem it introduced, you will still have to read every goddam line of the code to make sure it does what is supposed to.
For greenfield projects, some people recommended crafting a very detailed plan with very detailed description and very detailed specs and feed that into the AI tool.
AI can help with that, it asks questions I would never ask for an MVP and suggests stuff I would never implement for an MVP. Hurray, we have a very, very detailed plan, ready to feed into Cursor & Friends.
Based on the very detailed plan, implementation takes few hours. Than, fixing compile errors and fixing failing tests takes a few more days. Then I manually test the app, see it has issues, look in the code to see where the issues can be. Make a list. Ask Cursor & Friends to fix issues one by one. They happily do it and they happily introduce compilation errors again and break tests again. So the fixing phase that last days begins again.
Rinse and repeat until hopefully we spend a few weeks together (AI and I) instead on me building the MVP myself in half time.
One tactic which seems a bit faster, is to just make a hierarchical tree of features, ask Cursor & Friends to implement a simple skeleton, then ask them to implement each feature, verifying myself the implementation after each step. For example, if I need to log in users, just ask to add logging in code, the ask to add an email sender service, then ask to add email verification code.
Structuring the project using Vertical Slice Architecture and opening each feature folder in Cursor & Friends seems to improve the situation as the AI will have just enough context to modify or add something but can't break other parts of the code.
I dislike that AI can introduce inconsistencies in code. I had some endpoint which used timestamps and AI used three different types for that DateTime, DateTimeOffset and long (UNIX time). It also introduced code to convert between the types and lots of bugs. The AI uses some folder structure for a part of the solution and other structure for other parts. It uses some naming conventions in some parts and other naming conventions in other parts. It uses multiple libraries for the same thing, like multiple JSON serializing libraries. It does things in a particular way in some parts of the application and in another way in other parts. It seems like tens of people are working in the same solution without anyone reading the code of the others.
While asking AI to modify something, it will be very happy to modify things that you didn't ask to.
I still need to figure out a good workflow, to reduce time and money spent, to reduce or eliminate inconsistency, to reduce bugs and compile errors.
As an upside using AI to help with planning seems to be good, if I want to write the code myself, because the plan can be very thorough and I usually lack time and patience to make a very detailed plan.
Ah I really wanna trust AI won't "fix" the tests by commenting out the assert statements or changing the comparison inputs willy-nilly. I guess that's something terrible human engineers also do. I review changes to tests even more critically than the actual code.