> We don't just keep adding more words to our context window, because it would drive us mad.
That, and we also don't only focus on the textual description of a problem when we encounter a problem. We don't see the debugger output and go "how do I make this bad output go away?!?". Oh, I am getting an authentication error. Well, meaybe I should just delete the token check for that code path...problem solved?!
No. Problem very much not-solved. In fact, problem very much very bigger big problem now, and [Grug][1] find himself reaching for club again.
Software engineers are able to step back, think about the whole thing, and determine the root cause of a problem. I am getting an auth error...ok, what happens when the token is verified...oh, look, the problem is not the authentication at all...in fact there is no error! The test was simply bad and tried to call a higher privilege function as a lower privilege user. So, test needs to be fixed. And also, even though it isn't per-se an error, the response for that function should maybe differentiate between "401 because you didn't authenticate" and "401 because your privileges are too low".
Programmers are mostly translating business rules to the very formal process execution of the computer world. And you need to both knows what the rules means and how the computer works (or at least how the abstracted version you’re working with works). The translation is messy at first, which is why you need to revise it again and again. Especially when later rules comes challenging all the assumptions you’ve made or even contradicting themselves.
Even translations between human languages (which allows for ambiguity) can be messy. Imagine if the target language is for a system that will exactly do as told unless someone has qualified those actions as bad.
livid-neuro · 52m ago
The first cars broke down all the time. They had a limited range. There wasn't a vast supply of parts for them. There wasn't a vast industry of experts who could work on them. There wasn't a vast network of fuel stations to provide energy for them. The horse was a proven method.
What an LLM cannot do today is almost irrelevant in the tide of change upon the industry. The fact is, with improvements, it doesn't mean an LLM cannot do it tomorrow.
jerf · 28m ago
AI != LLM.
We can reasonably speak about certain fundamental limitations of LLMs without those being claims about what AI may ever do.
I would agree they fundamentally lack models of the current task and that it is not very likely that continually growing the context will solve that problem, since it hasn't already. That doesn't mean there won't someday be an AI that has a model much as we humans do. But I'm fairly confident it won't be an LLM. It may have an LLM as a component but the AI component won't be primarily an LLM. It'll be something else.
lbrandy · 11m ago
> has a model much as we humans do
The premise that an AI needs to do Y "as we do" to be good at X because humans use Y to be good at X needs closer examination. This presumption seems to be omnipresent in these conversations and I find it so strange. Alpha Zero doesn't model chess "the way we do".
byteknight · 18m ago
I have to disagree. Anyone that says LLMs do not qualify as AI are the same people who will continue to move the goal posts for AGI. "Well it doesn't do this!". No one here is trying to replicate a human brain or condition in its entirety. They just want to replicate the thinking ability of one. LLMs represent the closest parallel we have experienced thus far to that goal. Saying that LLMs are not AI feel disingenuous at best and entirely purposely dishonest at the worst (perhaps perceived as staving off the impending demise of a profession).
The sooner people stop worrying about a label for what you feel fits LLMs best, the sooner they can find the things they (LLMs) absolutely excel at and improve their (the user's) workflows.
Stop fighting the future. Its not replacing right now. Later? Maybe. But right now the developers and users fully embracing it are experiencing productivity boosts unseen previously.
Skepticism isn't the same thing as fighting the future.
I will call something AGI when it can reliably solve novel problems it hasn't been pre-trained on. That's my goal post and I haven't moved it.
sarchertech · 1m ago
> the developers and users fully embracing it are experiencing productivity boosts unseen previously
This is the kind of thing that I disagree with.
You think that LLMs are a bigger productivity boost than moving from moving from physically rewiring computers to using punch cards, from running programs as batch processes with printed output to getting immediate output, from programming in assembly to higher level languages, or even just moving from enterprise Java to Rails?
Night_Thastus · 39m ago
The difference is that the weaknesses of cars were problems of engineering, and some of infrastructure. Both aren't very hard to solve, though they take time. The fundamental way cars operated worked and just needed revision, sanding off rough edges.
LLMs are not like this. The fundamental way they operate, the core of their design is faulty. They don't understand rules or knowledge. They can't, despite marketing, really reason. They can't learn with each interaction. They don't understand what they write.
All they do is spit out the most likely text to follow some other text based on probability. For casual discussion about well-written topics, that's more than good enough. But for unique problems in a non-English language, it struggles. It always will. It doesn't matter how big you make the model.
They're great for writing boilerplate that has been written a million times with different variations - which can save programmers a LOT of time. The moment you hand them anything more complex it's asking for disaster.
skydhash · 47m ago
When the first cars broke down, people were not saying: One day, we’ll go to the moon with one of these.
LLMs may get better, but it will not be what people are clamoring them to be.
No comments yet
tobr · 32m ago
The article has a very nuanced point about why it’s not just a matter of today’s vs tomorrow’s LLMs. What’s lacking is a fundamental capacity to build mental models and learn new things specific to the problem at hand. Maybe this can be fixed in theory with some kind of on-the-fly finetuning, but it’s not just about more context.
brandon272 · 16m ago
The question is, when is “tomorrow”?
Dismissing a concern with “LLMs/AI can’t do it today but they will probably be able to do it tomorrow” isn’t all that useful or helpful when “tomorrow” in this context could just as easily be “two months from now” or “50 years from now”.
jedimastert · 46m ago
> The first cars broke down all the time. They had a limited range. There wasn't a vast supply of parts for them. There wasn't a vast industry of experts who could work on them.
I mean, there was and then there wasn't. All of those things are shrinking fast because we handed over control to people who care more about profits than customers because we got too comfy and too cheap, and now right to repair is screwed.
Honestly, I see llm-driven development as a threat to open source and right to repair, among the litany of other things
apwell23 · 5m ago
ugh.. no analogies pls
aaroninsf · 27m ago
My preferred formulation is Ximm's Law,
"Every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.
Lemma: any statement about AI which uses the word "never" to preclude some feature from future realization is false.
Lemma: contemporary implementations have almost always already been improved upon, but are unevenly distributed."
moregrist · 18m ago
Replace “AI” with “fusion” and you immediately see the problem: there’s no concept of timescale or cost.
And with fusion, we already have a working prototype (the Sun). And if we could just scale our tech up enough, maybe we’d have usable fusion.
latexr · 5m ago
> Every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.
That is too reductive and simply not true. Contemporary critiques of AI include that they waste precious resources (such as water and energy) and accelerate bad environmental and societal outcomes (such as climate change, the spread of misinformation, loss of expertise), among others. Critiques go far beyond “hur dur, LLM can’t code good”, and those problems are both serious and urgent. Keep sweeping critiques under the rug because “they’ll be solved in the next five years” (eternally away) and it may be too late. Critiques have to take into account the now and the very real repercussions already happening.
ai-christianson · 15m ago
I take a more pragmatic approach --everything is human in the loop. It helps me get the job done faster and with higher quality, so I use it.
reactordev · 19m ago
While I agree with you - The whole grug brain thing is offensive. Because we have all been grug at some point.
chuckadams · 39m ago
An AI might tell you to use a 403 for insufficient privileges instead of 401.
trod1234 · 52m ago
Isn't the 401 for LLMs the same single undecidable token?
Doesn't this basically go to the undecidable nature of math in CS?
Put another way, you have an excel roster corresponding to people with accounts where some need to have their account shutdown but you only have their first and last names as identifiers, and the pool is sufficiently large that there are more than one person per a given set of names.
You can't shut down all accounts with a given name, and there is no unique identifier. How do you solve this?
You have to ask and be given that unique identifier that differentiates between the undecidable. Without that, even the person can't do the task.
The person can make guesses, but those guesses are just hallucinations with a significant n probability towards a bad repeat outcome.
At a core level I don't think these type of issues are going to be solved. Quite a lot of people would be unable to solve this and struggle with this example (when not given the answer, or hinted at the solution in the framing of the task; ie when they just have a list of names and are told to do an impossible task).
throwaway1004 · 37m ago
That reference link is a wild ride of unqualified, cartoonish passive-aggression, the cute link to the author's "swag" is the icing on the cake.
Concidentally, I encountered the author's work for the first time only a couple of days ago as a podcast guest, he vouches for the "Dirty Code" approach while straw-manning Uncle Bob's general principles of balancing terseness/efficiency with ergonomics and readability (in most, but not all, cases).
I guess this stuff sells t-shirts and mugs /rant
Arainach · 27m ago
>Uncle Bob's general principles of balancing terseness/efficiency with ergonomics and readability (in most, but not all, cases).
Have you read Uncle Bob? There's no need to strawman: Bob's examples in Clean Code are absolutely nuts.
Here's a nice writeup that includes one of Bob's examples verbatim in case you've forgotten: https://qntm.org/clean
> big brained developers are many, and some not expected to like this, make sour face
chollida1 · 43m ago
Most of this might be true for LLM's but years of investing experience has created a mental model of looking for the tech or company that sucks and yet keeps growing.
People complained endlessly about the internet in the early to mid 90s, its slow, static, most sites had under construction signs on them, your phone modem would just randomly disconnect. The internet did suck in alot of ways and yet people kept using it.
Twitter sucked in the mid 2000s, we saw the fail whale weekly and yet people continued to use it for breaking news.
Electric cars sucked, no charging, low distance, expensive and yet no matter how much people complain about them they kept getting better.
Phones sucked, pre 3G was slow, there wasn't much you could use them for before app stores and the cameras were potato quality and yet people kept using them while they improved.
Always look for the technology that sucks and yet people keep using it because it provides value. LLM's aren't great at alot of tasks and yet no matter how much people complain about them, they keep getting used and keep improving through constant iteration.
LLM"s amy not be able to build software today, but they are 10x better than where they were in 2022 when we first started using chatgpt. Its pretty reasonable to assume in 5 years they will be able to do these types of development tasks.
freehorse · 17m ago
At the same time, there have been expectations about many of these that did not meet reality at any point. Much of this is due to physical limitations that are not trivial to be overcome. Internet gets faster and more stable, but the metaverse taking over did not happen partially because many people still get nausea after a bit and no 10x scaling fixed that.
A lot of what you described as "sucked" were not seen as "sucking" at the time. Nobody complained about the phones being slow because nobody expected to use phones the way we do today. The internet was slow and less stable but nobody complained because they expected to stream 4k movies and they could not. This is anachronistic.
The fact that we can see how some things improved in X Y manner does not mean that LLMs will improve the way you think they will. Maybe we invent a different technology that does a better job. After it was not that dial up itself became faster and I don't think there were fanatics saying that dialup technology would give us 1Gbp speeds. The problem with AI is that because scaling up compute has provided breakthroughs, some think that somehow with scaling up compute and some technical tricks we can solve all the current problems. I don't think that anybody can say that we cannot invent a technology that can overcome these, but if LLMs is this technology that can just keep scaling has been under doubt. Last year or so there has been a lot of refinement and broadening of applications, but nothing like a breakthrough.
runako · 17m ago
> Phones sucked, pre 3G was slow, there wasn't much you could use them for before app stores and the cameras were potato quality
This is a big rewrite of history. Phones took off because before mobile phones the only way to reach a person was to call when they were at home or their office. People were unreachable for timespans that now seem quaint. Texting brought this into async. The "potato" cameras were the advent of people always having a camera with them.
People using the Nokia 3210 were very much not anticipating when their phones would get good, they were already a killer app. That they improved was icing on the cake.
bunderbunder · 20m ago
This is such selective hindsight, though. We remember the small minority of products that persisted and got better. We don't remember the majority of ones that fizzled out after the novelty wore off, or that ultimately plateaued.
Me, I agree with the author of the article. It's possible that the technology will eventually get there, but it doesn't seem to be there now. And I prefer to make decisions based on present-day reality instead of just assuming that the future I want is the future I'll get.
chollida1 · 12m ago
> This is such selective hindsight, though.
Ha;) Yes, when you provide examples to prove your point they are, by definition, selective:)
You are free to develop your own mental models of what technology and companies to invest in. I was only trying to share my 20 years of experience with investing to show why you shouldn't discard current technology because of its current limits.
ausbah · 37m ago
those are really good points, but LLMs have really started to plateau off on their capabilities haven’t they? the improvements from gpt2 class models to 3 was much bigger then 3 to 4, which was only somewhat bigger than 4 to 5
most of the vibe shift I think I’ve seen in the past few months to using LLMs in the context of coding has been improvements in dataset curation and ux, not fundamentally better tech
worldsayshi · 28m ago
> LLMs have really started to plateau
That doesn't seem unexpected. Any technological leap seem to happen in sigmoid-like steps. When a fruitful approach is discovered we run to it until diminishing returns sets in. Often enough a new approach opens doors to other approaches that builds on it. It takes time to discover the next step in the chain but when we do we get a new sigmoid-like leap. Etc...
worldsayshi · 17m ago
Personally my bet for the next fruitful step is something in line with what Victor Taelin [1] is trying to achieve.
I.e. combining new approaches around old school "AI" with GenAI. That's probably not exactly what he's trying to do but maybe somewhere in the ball park.
All the other things he mentioned didn't rely on breakthroughs, LLMs really do seem to have reached a plateau and need a breakthrough to push along to the next step.
Thing is breakthroughs are always X years away (50 for fusion power for example).
The only example he gave that actually was kind of a big deal was mobile phones where capacitive touchscreens really did catapult the technology forward. But it is not like celphones weren't already super useful, profitable and getting better over time before capacitive touchscreens were introduced.
Maybe broadband to the internet also qualifies.
NitpickLawyer · 11m ago
> but LLMs have really started to plateau off on their capabilities haven’t they?
Uhhh, no?
In the past month we've had:
- LLMs (3 different models) getting gold at IMO
- gold at IoI
- beat 9/10 human developers at atcode heuristics (optimisations problems) with the single human that actually beat the machine saying he was exhausted and next year it'll probably be over.
- agentic that actually works. And works for 30-90 minute sessions while staying coherent and actually finishing tasks.
- 4-6x reduction in price for top tier (SotA?) models. oAI's "best" model now costs 10$/MTok, while retaining 90+% of their previous SotA models that were 40-60$/MTok.
- several "harnesses" being released by every model provider. Claude code seems to remain the best, but alternatives are popping off everywhere - geminicli, opencoder, qwencli (forked, but still), etc.
- opensource models that are getting close to SotA, again. Being 6-12months behind (depending on who you ask), opensource and cheap to run (~2$/MTok on some providers).
I don't see the plateauing in capabilities. LLMs are plateauing only in benchmarks, where number goes up can only go up so far until it becomes useless. IMO regular benchmarks have become useless. MMLU & co are cute, but agentic whatever is what matters. And those capabilities have only improved. And will continue to improve, with better data, better signals, better training recipes.
Why do you think eveyr model provider is heavily subsidising coding right now? They all want that sweet sweet data & signals, so they can improve their models.
cameronh90 · 13m ago
I'm not sure I'd describe it as a plateau. It might be, but I'm not convinced. Improvements are definitely not as immediately obvious now, but how much of that is due to it being more difficult to accurately gauge intelligence above a certain point? Or even that the marginal real life utility of intelligence _itself_ starts to plateau?
A (bad) analogy would be that I can pretty easily tell the difference between a cat and an ape, and the differences in capability are blatantly obvious - but the improvement when going from IQ 70 to Einstein are much harder to assess and arguably not that useful for most tasks.
I tend to find that when I switch to a new model, it doesn't seem any better, but then at some point after using it for a few weeks I'll try to use the older model again and be quite surprised at how much worse it is.
overgard · 17m ago
I'm not a fan of the argument that LLMs have gotten X times better in the past few years, so thusly they will continue to get X times better in the next few years. From what I can see, all the growth has mostly come from optimizing a few techniques, but I'm not convinced that we aren't going to get stuck in a local maxima (actually, I think that's the most likely outcome).
Specifically, to me the limitation of LLMs is discovering new knowledge and being able to reason about information they haven't seen before. LLMs still fail at things like counting the number of b's in the word blueberry or not getting distracted by inserting random cat facts in word problems (both issues I've seen appear in the last month)
I don't mean that to say they're a useless tool, I'm just not into the breathless hype.
JimDabell · 51m ago
LLMs can’t build software because we are expecting them to hear a few sentences, then immediately start coding until there’s a prototype. When they get something wrong, they have a huge amount of spaghetti to wade through. There’s little to no opportunity to iterate at a higher level before writing code.
If we put human engineering teams in the same situation, we’d expect them to do a terrible job, so why do we expect LLMs to do any better?
We can dramatically improve the output of LLM software development by using all those processes and tools that help engineering teams avoid these problems:
yup. I started a fully autonomous, 100% vibe coded side project called steadytext, mostly expecting it to hit a wall, with LLMs eventually struggling to maintain or fix any non-trivial bug in it. turns out I was wrong, not only has claude opus been able to write up a pretty complex 7k LoC project with a python library, a CLI, _and_ a postgres extension. It actively maintains it and is able to fix filed issues and feature requests entirely on its own. It is completely vibe coded, I have never even looked at 90% of the code in that repo. it has full test coverage, passes CI, and we use it in production!
granted- it needs careful planning for CLAUDE.md and all issues and feature requests need a lot of in-depth specifics but it all works. so I am not 100% convinced by this piece. I'd say it's def not easy to get coding agents to be able to manage and write software effectively and specially hard to do so in existing projects but my experience has been across that entire spectrum. I have been sorely disappointed in coding agents and even abandoned a bunch or projects and dozens of pull requests but I have also seen them work.
Saying LLMS are not good at x or y, is akin to saying a brain is useless without a body. Which is obvious. The success of agentic coding solutions depends on not just the model but also the system that the developers built around the model. And the companies that will succeed in this area are going to be the companies that focus on building sophisticated and capable systems that utilize said models. We are still in very early days where most organizations are only coming to terms with this realization... Only a few of them fully utilize this concept to the fullest, Claude code being the best example. The Claude models are specifically trained for tool calling and other capabilities and the Claude code cli compliments and takes advantage of those capabilities to the fullest, things like context management among other capabilities are extremely important ...
9cb14c1ec0 · 1h ago
> what they cannot do is maintain clear mental models
The more I use claude code, the more frustrated I get with this aspect. I'm not sure that a generic text-based LLM can properly solve this.
dlivingston · 52m ago
Reminds me of how Google's Genie 3 can only run for a ~minute before losing its internal state [0].
My gut feeling is that this problem won't be solved until some new architecture is invented, on the scale of the transformer, which allows for short-term context, long-term context, and self-modulation of model weights (to mimic "learning"). (Disclaimer: hobbyist with no formal training in machine learning.)
It’s the nature of formal system. Someones need to actually do the work of defining those rules or have a smaller set of rules that can generate the larger set. But anytime you invent a rule. That means a few things that are possible can’t be represented in the system. You’re mostly hoping that those things aren’t meaningful.
LLMs techniques allows us to extract rules from text and other data. But those data are not representative of a coherent system. The result itself is incoherent and lacks anything that wasn’t part of the data. And that’s normal.
It’s the same as having a mathematical function. Every point that it maps to is meaningful, everything else may as well not exists.
elephanlemon · 41m ago
I’ve been thinking about this recently… maybe a more workable solution at the moment is to run a hierarchy of agents, with the top level one maintaining the general mental model (and not filling its context with anything much more than “next agent down said this task was complete”). Definitely seems like anytime you try to have one Code agent run everything it just goes off the rails sooner or later, ignoring important details from your original instructions, failing to make sure it’s adhering to CLAUDE.md, etc. I think you can do this now with Code’s agent feature? Anyone have strategies to share?
skydhash · 33m ago
Telephone game don’t work that well. That’s how an emperor can be isolated in his palace and every edict becomes harmful. It’s why architect/developer didn’t work. You need to be aware of all the context you need to make sure you’ve done a good job
SoftTalker · 16m ago
Is this really that diffferent from the "average" programmer, especially a more junior one?
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.
I see this constantly with mediocre developers. Flailing, trying different things, copy-pasting from StackOverflow without understanding, ultimately deciding the compiler must have a bug, or cosmic rays are flipping bits.
layer8 · 2m ago
The article explicitly calls out that that’s what they are looking for in a competent software engineer. That incompetent developers exist, and that junior developers tend to not be very competent yet, doesn’t change anything about that. The problem with LLMs is that they’re already the final product of training/learning, not the starting point.
That and other tricks have only made me slightly less frustrated, though.
cmrdporcupine · 56m ago
Honestly it forces you -- rightfully -- to step back and be the one doing the planning.
You can let it do the grunt coding, and a lot of the low level analysis and testing, but you absolutely need to be the one in charge on the design.
It frankly gives me more time to think about the bigger picture within the amount of time I have to work on a task, and I like that side of things.
There's definitely room for a massive amount of improvement in how the tool presents changes and suggestions to the user. It needs to be far more interactive.
mock-possum · 27m ago
That’s my experience as well - I’m the one with the mental model, my responsibility is using text to communicate that model to the LLM using language it will recognize from its training data to generate the code to follow suit.
My experience with prompting LLMs for codegen is really not much different from my experience with querying search engines - you have to understand how to ‘speak the language’ of the corpus being searched, in order to find the results you’re looking for.
micromacrofoot · 25m ago
Yes this is exactly it, you need to talk to Claude about code on a design/architecture level... just telling it what you want the code to output will get you stuck in failure loops.
I keep saying it and no one really listens: AI really is advanced autocomplete. It's not reasoning or thinking. You will use the tool better if you understand what it can't do.
It's a good tool when you use it within its limitations.
1zael · 18m ago
> "when test fail, they are left guessing as to whether to fix the code or the tests"
I've one thing that helps is using the "Red-Green-Refactor" language. We're in RED phase - test should fail. We're in GREEN phase - make this test pass with minimal code. We're in REFACTOR phase - improve the code without breaking tests.
This helps the LLM understand the TDD mental model rather than just seeing "broken code" that needs fixing.
emilecantin · 1h ago
Yeah, I think it's pretty clear to a lot of people that LLMs aren't at the "build me Facebook, but for dogs" stage yet. I've had relatively good success with more targeted tasks, like "Add a modal that does this, take this existing modal as an example for code style". I also break my problem down into smaller chunks, and give them one by one to the LLM. It seems to work much better that way.
generalizations · 55m ago
These LLM discussions really need everyone to mention what LLM they're actually using.
> AI is awesome for coding! [Opus 4]
> No AI sucks for coding and it messed everything up! [4o]
Would really clear the air. People seem to be evaluating the dumbest models (apparently because they don't know any better?) and then deciding the whole AI thing just doesn't work.
taormina · 34m ago
I've used a wide variety of the "best" models, and I've mostly settled on Opus 4 and Sonnet 4 with Claude Code, but they don't ever actually get better. Grok 3-4 and GPT4 were worse, but like, at a certain point you don't get brownie points for not tripping over how low the bar is set.
omnicognate · 50m ago
What the article says is as true of Opus 4 as any other LLM.
troupo · 45m ago
> These LLM discussions really need everyone to mention what LLM they're actually using.
Do we know which codebases (greenfield, mature, proprietary etc.) people work on? No
Do we know the level of expertise the people have? No.
Is the expertise in the same domain, codebase, language that they apply LLMs to? We don't know.
How much additional work did they have reviewing, fixing, deploying, finishing etc.? We don't know.
--- end quote ---
And that's just the tip of the iceberg. And that is an iceberg before we hit another one: that we're trying to blindly reverse engineer a non-deterministic blackbox inside a provider's blackbox
Transfinity · 55m ago
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.
I feel personally described by this statement. At least on a bad day, or if I'm phoning it in. Not sure if that says anything about AI - maybe just that the whole "mental models" part is quite hard.
apples_oranges · 49m ago
It means something is not understood. Could be the product, the code in question, or computers in general. 90% of coders seem to be lacking foundational knowledge imho. Not trying to hate on anyone, but when you have the basics down, you can usually see quickly where the problem is, or at least must be.
aniviacat · 16m ago
Unfortunately, "usually" is a key word here.
pjmlp · 34m ago
Only because most AI startups are doing it wrong.
I don't want a chat window.
I want AI workflows as part of my IDE, like Visual Studio, InteliJ, Android Studio are finally going after.
I want voice controlled actions on my native language.
Knowledge across everything on the project for doing code refactorings, static analysis with AI feedback loop, generating UI based out of handwritten sketches, programming on the go using handwriting, source control commit messages out of code changes,...
lordnacho · 45m ago
I think I agree with the idea that LLMs are good at the junior level stuff.
What's happened for me recently is I've started to revisit the idea that typing speed doesn't matter.
This is an age-old thing, most people don't think it really matters how fast you can type. I suppose the steelman is, most people think it doesn't really matters how fast you can get the edits to your code that you want. With modern tools, you're not typing out all the code anyway, and there's all sorts of non-AI ways to get your code looking the way you want. And that doesn't matter, the real work of the engineer is the architecture of how the whole program functions. Typing things faster doesn't make you get to the goal faster, since finding the overall design is the limiting thing.
But I've been using Claude for a while now, and I'm starting to see the real benefit: you no longer need to concentrate to rework the code.
It used to be burdensome to do certain things. For instance, I decided to add an enum value, and now I have to address all the places where it matches on that enum. This wasn't intellectually hard in the old world, you just got the compiler to tell you where the problems were, and you added a little section for your new value to do whatever it needed, in all the places it appeared.
But you had to do this carefully, otherwise you would just cause more compile/error cycles. Little things like forgetting a semicolon will eat a cycle, and old tools would just tell you the error was there, not fix it for you.
LLMs fix it for you. Now you can just tell Claude to change all the code in a loop until it compiles. You can have multiple agents working on your code, fixing little things in many places, while you sit on HN and muse about it. Or perhaps spend the time considering what direction the code needs to go.
The big thing however is that when you're no longer held up by little compile errors, you can do more things. I had a whole laundry list of things I wanted to change about my codebase, and Claude did them all. Nothing on the business level of "what does this system do" but plenty of little tasks that previously would take a junior guy all day to do. With the ability to change large amounts of code quickly, I'm able to develop the architecture a lot faster.
It's also a motivation thing: I feel bogged down when I'm just fixing compile errors, so I prioritize what to spend my time on if I am doing traditional programming. Now I can just do the whole laundry list, because I'm not the guy doing it.
ambicapter · 31m ago
> I had a whole laundry list of things I wanted to change about my codebase
I always have a whole bunch of things I want to change in the codebase I'm working on, and the bottleneck is review, not me changing that code.
lordnacho · 21m ago
Those are the same thing though? You change the code, but can't just edit it without testing it.
LLM also helps you test.
ontigola · 13m ago
Great, concise article. Nothing important to add, except that AI snake-oil salesmen will continue spreading their exaggerations far and wide, at least we who are truly in this business agree on the facts.
Onewildgamer · 29m ago
I wonder if some of this can be solved by removing some wrongly setup context in LLM. Or get a short summary, restructure it and againt feed to a fresh LLM context.
empath75 · 1h ago
It's good at micro, but not macro. I think that will eventually change with smarter engineering around it, larger context windows, etc. Never underestimate how much code that engineers will write to avoid writing code.
pmdr · 55m ago
> It's good at micro, but not macro.
That's what I've found as well. Start describing or writing a function, include the whole file for context and it'll do its job. Give it a whole codebase and it will just wander in the woods burning tokens for ten minutes trying to solve dependencies.
saghm · 57m ago
> Context omission: Models are bad at finding omitted context.
> Recency bias: They suffer a strong recency bias in the context window.
> Hallucination: They commonly hallucinate details that should not be there.
To be fair, those are all issues that most human engineers I've worked with (including myself!) have struggled with to various degrees, even if we don't refer to them the same way. I don't know about the rest of you, but I've certainly had times where I found out that an important nuance of a design was overlooked until well into the process of developing something, forgotten a crucial detail that I learned months ago that would have helped me debug something much faster than if I had remembered it from the start, or accidentally make an assumption about how something worked (or misremembered it) and ended up with buggy code as a result. I've mostly gotten pretty positive feedback about my work over the course of my career, so if I "can't build software", I have to worry about the companies that have been employing me and my coworkers who have praised my work output over the years. Then again, I think "humans can't build software reliably" is probably a mostly correct statement, so maybe the lesson here is that software is hard in general.
skydhash · 25m ago
That’s a communication issue. You should learn how to ask the right questions and document the answers given. What I’ve seen is developers assuming stuff when they should just reach out to team members. Or trying stuff instead of reading documentation. Or trying to remember info instead of noting it down somewhere.
sneak · 34m ago
Am I the only one continuously astounded at how well Opus 4 actually does build mental models when prompted correctly?
I find Sonnet frequently loses the plot, but Opus can usually handle it (with sufficient clarity in prompting).
trod1234 · 1h ago
I think most people trying to touch on this topic don't consider this byline with other similar bylines like, "Why LLMs can't recognize themselves looping", or "Why LLMs can't express intent", or "Why LLMs can't recognize truth/falsity, or confidence levels of what they know vs don't know", these other bylines basically with a little thought equate to Computer Science halting problems, or the undecidability nature of mathematics.
Taken to a next step, recognizing this makes the investment in such a moonshot pipedream (overcoming these inherent problems in a deterministic way), recklessly negligent.
antihipocrat · 41m ago
..."(at least for now) you are in the drivers seat, and the LLM is just another tool to reach for."
Improvements in model performance seem to be approaching the peak rather than demonstrating exponential gains. Is the quote above where we land in the end?
revskill · 45m ago
They can read and mind the error then figure out the best way to resolve. It is the best part about llm. No human can do it better than an llm. But they are not your mind reader. It is where things fall apart.
Nickersf · 55m ago
I think they're another tool in the toolbox not a new workshop. You have to build a good strategy around LLM usage when developing software. I think people are naturally noticing that and adapting.
jmclnx · 1h ago
I am not a fan of today's concept of "AI", but to be fair, building today's software is not for the faint of heart, very few people gets it right on try 1.
Years ago I gave up compiling these large applications all together. I compiled Firefox via FreeBSD's (v8.x) ports system, that alone was a nightmare.
I cannot imagine what it would be like to compile GNOME3 or KDE or Libreoffice. Emacs is the largest thing I compile now.
anotherhue · 1h ago
I suggest trying Nix, by being reproducible those nasty compilation demons get solved once and for all. (And usually by someone else)
trod1234 · 43m ago
The problem with Nix is that its often claimed to be reproducible, but the proof isn't really there because of the existence of collisions. The definition of reproducible is taken in such an isolated context as to be almost absurd.
While a collision hasn't yet been found for a SHA256 package on Nix, by the pigeonhole principle they exist, and the computer will not be able to decide between the two packages in such a collision leading to system level failure, with errors that have no link to cause (due to the properties involved, and longstanding CS problems in computation).
These things generally speaking contain properties of mathematical chaos which is a state that is inherently unknowable/unpredictable that no admin would ever approach or touch because its unmaintainable. The normally tightly coupled error handling code is no longer tightly coupled because it requires matching a determinable state (CS computation problems, halting/decidability).
Non-deterministic failure domains are the most costly problems to solve because troubleshooting which leverages properties of determinism, won't work.
This leaves you only a strategy of guess and check; which requires intimate knowledge of the entire system stack without abstractions present.
codr7 · 29m ago
Well, welcome to the club of awareness :)
mccoyb · 14m ago
This is a low information density blog post. I’ve really liked Zed’s blog posts in the past (especially about the editor internals!) so I hope this doesn’t come the wrong way, but this seems to be a loose restatement of what many people are empirically finding out by using LLM agents.
Perhaps good for someone just getting their feet wet with these computational objects, but not resolving or explaining things in a clear way, or highlighting trends in research and engineering that might point towards ways forward.
You also have a technical writing no no where you cite a rather precise and specific study with a paraphrase to support your claims … analogous to saying “Godel’s incompleteness theorem means _something something_ about the nature of consciousness”.
A phrase like: “Unfortunately, for now, they cannot (beyond a certain complexity) actually understand what is going on” referencing a precise study … is ambiguous and shoddy technical writing — what exactly does the author mean here? It’s vague.
I think it is even worse here because _the original study_ provides task-specific notions of complexity (a critique of the original study! Won’t different representations lead to different complexity scaling behavior? Of course! That’s what software engineering is all about: I need to think at different levels to control my exposure to complexity)
That, and we also don't only focus on the textual description of a problem when we encounter a problem. We don't see the debugger output and go "how do I make this bad output go away?!?". Oh, I am getting an authentication error. Well, meaybe I should just delete the token check for that code path...problem solved?!
No. Problem very much not-solved. In fact, problem very much very bigger big problem now, and [Grug][1] find himself reaching for club again.
Software engineers are able to step back, think about the whole thing, and determine the root cause of a problem. I am getting an auth error...ok, what happens when the token is verified...oh, look, the problem is not the authentication at all...in fact there is no error! The test was simply bad and tried to call a higher privilege function as a lower privilege user. So, test needs to be fixed. And also, even though it isn't per-se an error, the response for that function should maybe differentiate between "401 because you didn't authenticate" and "401 because your privileges are too low".
[1]: https://grugbrain.dev
Even translations between human languages (which allows for ambiguity) can be messy. Imagine if the target language is for a system that will exactly do as told unless someone has qualified those actions as bad.
What an LLM cannot do today is almost irrelevant in the tide of change upon the industry. The fact is, with improvements, it doesn't mean an LLM cannot do it tomorrow.
We can reasonably speak about certain fundamental limitations of LLMs without those being claims about what AI may ever do.
I would agree they fundamentally lack models of the current task and that it is not very likely that continually growing the context will solve that problem, since it hasn't already. That doesn't mean there won't someday be an AI that has a model much as we humans do. But I'm fairly confident it won't be an LLM. It may have an LLM as a component but the AI component won't be primarily an LLM. It'll be something else.
The premise that an AI needs to do Y "as we do" to be good at X because humans use Y to be good at X needs closer examination. This presumption seems to be omnipresent in these conversations and I find it so strange. Alpha Zero doesn't model chess "the way we do".
The sooner people stop worrying about a label for what you feel fits LLMs best, the sooner they can find the things they (LLMs) absolutely excel at and improve their (the user's) workflows.
Stop fighting the future. Its not replacing right now. Later? Maybe. But right now the developers and users fully embracing it are experiencing productivity boosts unseen previously.
Language is what people use it as.
Skepticism isn't the same thing as fighting the future.
I will call something AGI when it can reliably solve novel problems it hasn't been pre-trained on. That's my goal post and I haven't moved it.
This is the kind of thing that I disagree with.
You think that LLMs are a bigger productivity boost than moving from moving from physically rewiring computers to using punch cards, from running programs as batch processes with printed output to getting immediate output, from programming in assembly to higher level languages, or even just moving from enterprise Java to Rails?
LLMs are not like this. The fundamental way they operate, the core of their design is faulty. They don't understand rules or knowledge. They can't, despite marketing, really reason. They can't learn with each interaction. They don't understand what they write.
All they do is spit out the most likely text to follow some other text based on probability. For casual discussion about well-written topics, that's more than good enough. But for unique problems in a non-English language, it struggles. It always will. It doesn't matter how big you make the model.
They're great for writing boilerplate that has been written a million times with different variations - which can save programmers a LOT of time. The moment you hand them anything more complex it's asking for disaster.
LLMs may get better, but it will not be what people are clamoring them to be.
No comments yet
Dismissing a concern with “LLMs/AI can’t do it today but they will probably be able to do it tomorrow” isn’t all that useful or helpful when “tomorrow” in this context could just as easily be “two months from now” or “50 years from now”.
I mean, there was and then there wasn't. All of those things are shrinking fast because we handed over control to people who care more about profits than customers because we got too comfy and too cheap, and now right to repair is screwed.
Honestly, I see llm-driven development as a threat to open source and right to repair, among the litany of other things
"Every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.
Lemma: any statement about AI which uses the word "never" to preclude some feature from future realization is false.
Lemma: contemporary implementations have almost always already been improved upon, but are unevenly distributed."
And with fusion, we already have a working prototype (the Sun). And if we could just scale our tech up enough, maybe we’d have usable fusion.
That is too reductive and simply not true. Contemporary critiques of AI include that they waste precious resources (such as water and energy) and accelerate bad environmental and societal outcomes (such as climate change, the spread of misinformation, loss of expertise), among others. Critiques go far beyond “hur dur, LLM can’t code good”, and those problems are both serious and urgent. Keep sweeping critiques under the rug because “they’ll be solved in the next five years” (eternally away) and it may be too late. Critiques have to take into account the now and the very real repercussions already happening.
Put another way, you have an excel roster corresponding to people with accounts where some need to have their account shutdown but you only have their first and last names as identifiers, and the pool is sufficiently large that there are more than one person per a given set of names.
You can't shut down all accounts with a given name, and there is no unique identifier. How do you solve this?
You have to ask and be given that unique identifier that differentiates between the undecidable. Without that, even the person can't do the task.
The person can make guesses, but those guesses are just hallucinations with a significant n probability towards a bad repeat outcome.
At a core level I don't think these type of issues are going to be solved. Quite a lot of people would be unable to solve this and struggle with this example (when not given the answer, or hinted at the solution in the framing of the task; ie when they just have a list of names and are told to do an impossible task).
Concidentally, I encountered the author's work for the first time only a couple of days ago as a podcast guest, he vouches for the "Dirty Code" approach while straw-manning Uncle Bob's general principles of balancing terseness/efficiency with ergonomics and readability (in most, but not all, cases).
I guess this stuff sells t-shirts and mugs /rant
Have you read Uncle Bob? There's no need to strawman: Bob's examples in Clean Code are absolutely nuts.
Here's a nice writeup that includes one of Bob's examples verbatim in case you've forgotten: https://qntm.org/clean
Here's another: https://gerlacdt.github.io/blog/posts/clean_code/
People complained endlessly about the internet in the early to mid 90s, its slow, static, most sites had under construction signs on them, your phone modem would just randomly disconnect. The internet did suck in alot of ways and yet people kept using it.
Twitter sucked in the mid 2000s, we saw the fail whale weekly and yet people continued to use it for breaking news.
Electric cars sucked, no charging, low distance, expensive and yet no matter how much people complain about them they kept getting better.
Phones sucked, pre 3G was slow, there wasn't much you could use them for before app stores and the cameras were potato quality and yet people kept using them while they improved.
Always look for the technology that sucks and yet people keep using it because it provides value. LLM's aren't great at alot of tasks and yet no matter how much people complain about them, they keep getting used and keep improving through constant iteration.
LLM"s amy not be able to build software today, but they are 10x better than where they were in 2022 when we first started using chatgpt. Its pretty reasonable to assume in 5 years they will be able to do these types of development tasks.
A lot of what you described as "sucked" were not seen as "sucking" at the time. Nobody complained about the phones being slow because nobody expected to use phones the way we do today. The internet was slow and less stable but nobody complained because they expected to stream 4k movies and they could not. This is anachronistic.
The fact that we can see how some things improved in X Y manner does not mean that LLMs will improve the way you think they will. Maybe we invent a different technology that does a better job. After it was not that dial up itself became faster and I don't think there were fanatics saying that dialup technology would give us 1Gbp speeds. The problem with AI is that because scaling up compute has provided breakthroughs, some think that somehow with scaling up compute and some technical tricks we can solve all the current problems. I don't think that anybody can say that we cannot invent a technology that can overcome these, but if LLMs is this technology that can just keep scaling has been under doubt. Last year or so there has been a lot of refinement and broadening of applications, but nothing like a breakthrough.
This is a big rewrite of history. Phones took off because before mobile phones the only way to reach a person was to call when they were at home or their office. People were unreachable for timespans that now seem quaint. Texting brought this into async. The "potato" cameras were the advent of people always having a camera with them.
People using the Nokia 3210 were very much not anticipating when their phones would get good, they were already a killer app. That they improved was icing on the cake.
Me, I agree with the author of the article. It's possible that the technology will eventually get there, but it doesn't seem to be there now. And I prefer to make decisions based on present-day reality instead of just assuming that the future I want is the future I'll get.
Ha;) Yes, when you provide examples to prove your point they are, by definition, selective:)
You are free to develop your own mental models of what technology and companies to invest in. I was only trying to share my 20 years of experience with investing to show why you shouldn't discard current technology because of its current limits.
most of the vibe shift I think I’ve seen in the past few months to using LLMs in the context of coding has been improvements in dataset curation and ux, not fundamentally better tech
That doesn't seem unexpected. Any technological leap seem to happen in sigmoid-like steps. When a fruitful approach is discovered we run to it until diminishing returns sets in. Often enough a new approach opens doors to other approaches that builds on it. It takes time to discover the next step in the chain but when we do we get a new sigmoid-like leap. Etc...
I.e. combining new approaches around old school "AI" with GenAI. That's probably not exactly what he's trying to do but maybe somewhere in the ball park.
1 - https://x.com/victortaelin
Thing is breakthroughs are always X years away (50 for fusion power for example).
The only example he gave that actually was kind of a big deal was mobile phones where capacitive touchscreens really did catapult the technology forward. But it is not like celphones weren't already super useful, profitable and getting better over time before capacitive touchscreens were introduced.
Maybe broadband to the internet also qualifies.
Uhhh, no?
In the past month we've had:
- LLMs (3 different models) getting gold at IMO
- gold at IoI
- beat 9/10 human developers at atcode heuristics (optimisations problems) with the single human that actually beat the machine saying he was exhausted and next year it'll probably be over.
- agentic that actually works. And works for 30-90 minute sessions while staying coherent and actually finishing tasks.
- 4-6x reduction in price for top tier (SotA?) models. oAI's "best" model now costs 10$/MTok, while retaining 90+% of their previous SotA models that were 40-60$/MTok.
- several "harnesses" being released by every model provider. Claude code seems to remain the best, but alternatives are popping off everywhere - geminicli, opencoder, qwencli (forked, but still), etc.
- opensource models that are getting close to SotA, again. Being 6-12months behind (depending on who you ask), opensource and cheap to run (~2$/MTok on some providers).
I don't see the plateauing in capabilities. LLMs are plateauing only in benchmarks, where number goes up can only go up so far until it becomes useless. IMO regular benchmarks have become useless. MMLU & co are cute, but agentic whatever is what matters. And those capabilities have only improved. And will continue to improve, with better data, better signals, better training recipes.
Why do you think eveyr model provider is heavily subsidising coding right now? They all want that sweet sweet data & signals, so they can improve their models.
A (bad) analogy would be that I can pretty easily tell the difference between a cat and an ape, and the differences in capability are blatantly obvious - but the improvement when going from IQ 70 to Einstein are much harder to assess and arguably not that useful for most tasks.
I tend to find that when I switch to a new model, it doesn't seem any better, but then at some point after using it for a few weeks I'll try to use the older model again and be quite surprised at how much worse it is.
Specifically, to me the limitation of LLMs is discovering new knowledge and being able to reason about information they haven't seen before. LLMs still fail at things like counting the number of b's in the word blueberry or not getting distracted by inserting random cat facts in word problems (both issues I've seen appear in the last month)
I don't mean that to say they're a useless tool, I'm just not into the breathless hype.
If we put human engineering teams in the same situation, we’d expect them to do a terrible job, so why do we expect LLMs to do any better?
We can dramatically improve the output of LLM software development by using all those processes and tools that help engineering teams avoid these problems:
https://jim.dabell.name/articles/2025/08/08/autonomous-softw...
granted- it needs careful planning for CLAUDE.md and all issues and feature requests need a lot of in-depth specifics but it all works. so I am not 100% convinced by this piece. I'd say it's def not easy to get coding agents to be able to manage and write software effectively and specially hard to do so in existing projects but my experience has been across that entire spectrum. I have been sorely disappointed in coding agents and even abandoned a bunch or projects and dozens of pull requests but I have also seen them work.
you can check out that project here: https://github.com/julep-ai/steadytext/
The more I use claude code, the more frustrated I get with this aspect. I'm not sure that a generic text-based LLM can properly solve this.
My gut feeling is that this problem won't be solved until some new architecture is invented, on the scale of the transformer, which allows for short-term context, long-term context, and self-modulation of model weights (to mimic "learning"). (Disclaimer: hobbyist with no formal training in machine learning.)
[0]: https://news.ycombinator.com/item?id=44798166
LLMs techniques allows us to extract rules from text and other data. But those data are not representative of a coherent system. The result itself is incoherent and lacks anything that wasn’t part of the data. And that’s normal.
It’s the same as having a mathematical function. Every point that it maps to is meaningful, everything else may as well not exists.
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.
I see this constantly with mediocre developers. Flailing, trying different things, copy-pasting from StackOverflow without understanding, ultimately deciding the compiler must have a bug, or cosmic rays are flipping bits.
That and other tricks have only made me slightly less frustrated, though.
You can let it do the grunt coding, and a lot of the low level analysis and testing, but you absolutely need to be the one in charge on the design.
It frankly gives me more time to think about the bigger picture within the amount of time I have to work on a task, and I like that side of things.
There's definitely room for a massive amount of improvement in how the tool presents changes and suggestions to the user. It needs to be far more interactive.
My experience with prompting LLMs for codegen is really not much different from my experience with querying search engines - you have to understand how to ‘speak the language’ of the corpus being searched, in order to find the results you’re looking for.
I keep saying it and no one really listens: AI really is advanced autocomplete. It's not reasoning or thinking. You will use the tool better if you understand what it can't do.
It's a good tool when you use it within its limitations.
I've one thing that helps is using the "Red-Green-Refactor" language. We're in RED phase - test should fail. We're in GREEN phase - make this test pass with minimal code. We're in REFACTOR phase - improve the code without breaking tests.
This helps the LLM understand the TDD mental model rather than just seeing "broken code" that needs fixing.
> AI is awesome for coding! [Opus 4]
> No AI sucks for coding and it messed everything up! [4o]
Would really clear the air. People seem to be evaluating the dumbest models (apparently because they don't know any better?) and then deciding the whole AI thing just doesn't work.
They need to mention significantly more than that: https://dmitriid.com/everything-around-llms-is-still-magical...
--- start quote ---
Do we know which projects people work on? No
Do we know which codebases (greenfield, mature, proprietary etc.) people work on? No
Do we know the level of expertise the people have? No.
Is the expertise in the same domain, codebase, language that they apply LLMs to? We don't know.
How much additional work did they have reviewing, fixing, deploying, finishing etc.? We don't know.
--- end quote ---
And that's just the tip of the iceberg. And that is an iceberg before we hit another one: that we're trying to blindly reverse engineer a non-deterministic blackbox inside a provider's blackbox
I feel personally described by this statement. At least on a bad day, or if I'm phoning it in. Not sure if that says anything about AI - maybe just that the whole "mental models" part is quite hard.
I don't want a chat window.
I want AI workflows as part of my IDE, like Visual Studio, InteliJ, Android Studio are finally going after.
I want voice controlled actions on my native language.
Knowledge across everything on the project for doing code refactorings, static analysis with AI feedback loop, generating UI based out of handwritten sketches, programming on the go using handwriting, source control commit messages out of code changes,...
What's happened for me recently is I've started to revisit the idea that typing speed doesn't matter.
This is an age-old thing, most people don't think it really matters how fast you can type. I suppose the steelman is, most people think it doesn't really matters how fast you can get the edits to your code that you want. With modern tools, you're not typing out all the code anyway, and there's all sorts of non-AI ways to get your code looking the way you want. And that doesn't matter, the real work of the engineer is the architecture of how the whole program functions. Typing things faster doesn't make you get to the goal faster, since finding the overall design is the limiting thing.
But I've been using Claude for a while now, and I'm starting to see the real benefit: you no longer need to concentrate to rework the code.
It used to be burdensome to do certain things. For instance, I decided to add an enum value, and now I have to address all the places where it matches on that enum. This wasn't intellectually hard in the old world, you just got the compiler to tell you where the problems were, and you added a little section for your new value to do whatever it needed, in all the places it appeared.
But you had to do this carefully, otherwise you would just cause more compile/error cycles. Little things like forgetting a semicolon will eat a cycle, and old tools would just tell you the error was there, not fix it for you.
LLMs fix it for you. Now you can just tell Claude to change all the code in a loop until it compiles. You can have multiple agents working on your code, fixing little things in many places, while you sit on HN and muse about it. Or perhaps spend the time considering what direction the code needs to go.
The big thing however is that when you're no longer held up by little compile errors, you can do more things. I had a whole laundry list of things I wanted to change about my codebase, and Claude did them all. Nothing on the business level of "what does this system do" but plenty of little tasks that previously would take a junior guy all day to do. With the ability to change large amounts of code quickly, I'm able to develop the architecture a lot faster.
It's also a motivation thing: I feel bogged down when I'm just fixing compile errors, so I prioritize what to spend my time on if I am doing traditional programming. Now I can just do the whole laundry list, because I'm not the guy doing it.
I always have a whole bunch of things I want to change in the codebase I'm working on, and the bottleneck is review, not me changing that code.
LLM also helps you test.
That's what I've found as well. Start describing or writing a function, include the whole file for context and it'll do its job. Give it a whole codebase and it will just wander in the woods burning tokens for ten minutes trying to solve dependencies.
> Recency bias: They suffer a strong recency bias in the context window.
> Hallucination: They commonly hallucinate details that should not be there.
To be fair, those are all issues that most human engineers I've worked with (including myself!) have struggled with to various degrees, even if we don't refer to them the same way. I don't know about the rest of you, but I've certainly had times where I found out that an important nuance of a design was overlooked until well into the process of developing something, forgotten a crucial detail that I learned months ago that would have helped me debug something much faster than if I had remembered it from the start, or accidentally make an assumption about how something worked (or misremembered it) and ended up with buggy code as a result. I've mostly gotten pretty positive feedback about my work over the course of my career, so if I "can't build software", I have to worry about the companies that have been employing me and my coworkers who have praised my work output over the years. Then again, I think "humans can't build software reliably" is probably a mostly correct statement, so maybe the lesson here is that software is hard in general.
I find Sonnet frequently loses the plot, but Opus can usually handle it (with sufficient clarity in prompting).
Taken to a next step, recognizing this makes the investment in such a moonshot pipedream (overcoming these inherent problems in a deterministic way), recklessly negligent.
Improvements in model performance seem to be approaching the peak rather than demonstrating exponential gains. Is the quote above where we land in the end?
Years ago I gave up compiling these large applications all together. I compiled Firefox via FreeBSD's (v8.x) ports system, that alone was a nightmare.
I cannot imagine what it would be like to compile GNOME3 or KDE or Libreoffice. Emacs is the largest thing I compile now.
While a collision hasn't yet been found for a SHA256 package on Nix, by the pigeonhole principle they exist, and the computer will not be able to decide between the two packages in such a collision leading to system level failure, with errors that have no link to cause (due to the properties involved, and longstanding CS problems in computation).
These things generally speaking contain properties of mathematical chaos which is a state that is inherently unknowable/unpredictable that no admin would ever approach or touch because its unmaintainable. The normally tightly coupled error handling code is no longer tightly coupled because it requires matching a determinable state (CS computation problems, halting/decidability).
Non-deterministic failure domains are the most costly problems to solve because troubleshooting which leverages properties of determinism, won't work.
This leaves you only a strategy of guess and check; which requires intimate knowledge of the entire system stack without abstractions present.
Perhaps good for someone just getting their feet wet with these computational objects, but not resolving or explaining things in a clear way, or highlighting trends in research and engineering that might point towards ways forward.
You also have a technical writing no no where you cite a rather precise and specific study with a paraphrase to support your claims … analogous to saying “Godel’s incompleteness theorem means _something something_ about the nature of consciousness”.
A phrase like: “Unfortunately, for now, they cannot (beyond a certain complexity) actually understand what is going on” referencing a precise study … is ambiguous and shoddy technical writing — what exactly does the author mean here? It’s vague.
I think it is even worse here because _the original study_ provides task-specific notions of complexity (a critique of the original study! Won’t different representations lead to different complexity scaling behavior? Of course! That’s what software engineering is all about: I need to think at different levels to control my exposure to complexity)