(Works on older browsers and doesn't require JavaScript except to get past CloudSnare).
dirkc · 1h ago
I have a friend that always says "innovation happens at the speed of trust". Ever since GPT3, that quote comes to mind over and over.
Verification has a high cost and trust is the main way to lower that cost. I don't see how one can build trust in LLMs. While they are extremely articulate in both code and natural language, they will also happily go down fractal rabbit holes and show behavior I would consider malicious in a person.
lubujackson · 1m ago
We never can have total trust in LLM output, but we can certainly sanitize it and limit it's destructive range. Just like we sanitize user input and defend with pentests and hide secrets in dot files, we will eventually resolve to "best practices" and some "SOC-AI compliance" standard down the road.
It's just too useful to ignore, and trust is always built, brick by brick. Let's not forget humans are far from reliable anyway. Just like with driving cars, I imagine producing less buggy code (along predefined roads) will soon outpace humans. Then it is just blocking and tackling to improve complexity.
whiplash451 · 23m ago
> "innovation happens at the speed of trust"
You'll have to elaborate on that. How much trust was there in electricity, flight and radioactivity when we discovered them?
In science, you build trust as you go.
agent281 · 9m ago
Have you heard of the War of the Currents?
> As the use of AC spread rapidly with other companies deploying their own systems, the Edison Electric Light Company claimed in early 1888 that high voltages used in an alternating current system were hazardous, and that the design was inferior to, and infringed on the patents behind, their direct current system.
> In the spring of 1888, a media furor arose over electrical fatalities caused by pole-mounted high-voltage AC lines, attributed to the greed and callousness of the arc lighting companies that operated them.
Author here:
I quite like that quote. A very succinct way of saying what took me a few paragraphs.
This new world of having to verify every single thing at all points is quite exhausting and frankly pretty slow.
Herring · 54m ago
So get another LLM to do it. Judging is considerably easier [For LLMs] than writing something from scratch, so LLM judges will always have that edge in accuracy. Equivalently, I also like getting them to write tons of tests to build trust in correct behavior.
acedTrex · 50m ago
> Judging is considerably easier than writing something from scratch
I don't agree with this at all. Writing new code is trivially easy, to do a full in depth review takes significantly more brain power. You have to fully ascertain and insert yourself into someone elses thought process. Thats way more work than utilizing your own thought process.
Herring · 39m ago
Sorry, I should have been more specific. I meant LLMs are more reliable and accurate at judging than at generating from scratch.
They basically achieve over 80% agreement with human evaluators [1]. This level of agreement is similar to the consensus rate between two human evaluators, making LLM-as-a-judge a scalable and reliable proxy for human judgment.
Oh goodness that's like trusting one kid to tell you whether or not his friend lied.
In matters where trust matters, it's a recipe for disaster.
stavros · 5h ago
I don't understand the premise. If I trust someone to write good code, I learned to trust them because their code works well, not because I have a theory of mind for them that "produces good code" a priori.
If someone uses an LLM and produces bug-free code, I'll trust them. If someone uses an LLM and produces buggy code, I won't trust them. How is this different from when they were only using their brain to produce the code?
acedTrex · 2h ago
Author here:
Essentially the premise is that in medium trust environments like very large teams or low trust environments like an open source project.
LLMs make it very difficult to make an immediate snap judgement about the quality of the dev that submitted the patch based solely on the code itself.
In the absence of being able to ascertain the type of person you are dealing with you have to fall back too "no trust" and review everything with a very fine tooth comb. Essentially there are no longer any safe "review shortcuts" and that can be painful in places that relied on those markers to grease the wheels so to speak.
Obviously if you are in an existing competent high trust team then this problem does not apply and most likely seems completely foreign as a concept.
lxgr · 2h ago
> LLMs make it very difficult to make an immediate snap judgement about the quality [...]
That's the core of the issue. It's time to say goodbye to heuristics like "the blog post is written in eloquent, grammatical English, hence the point its author is trying to make must be true" or "the code is idiomatic and following all code styles, hence it must be modeling the world with high fidelity".
Maybe that's not the worst thing in the world. I feel like it often made people complacent.
acedTrex · 1h ago
> Maybe that's not the worst thing in the world. I feel like it often made people complacent.
For sure, in some ways perhaps reverting to a low trust environment might improve quality in that it now forces harsher/more in depth reviews.
That however doesn't make the requirement less exhausting for people previously relying heavily on those markers to speed things up.
Will be very interesting to see how the industry standardizes around this. Right now it's a bit of the wild west. Maybe people in ten years will look back at this post and think "what do you mean you judged people based on the code itself that's ridiculous"
furyofantares · 1h ago
I think you're unfair to the heuristics people use in your framing here.
You said "hence the point its author is trying to make must be true" and "hence it must be modeling the world with high fidelity".
But it's more like "hence the author is likely competent and likely put in a reasonable effort."
When those assumptions hold, putting in a very deep review is less likely to pay off. Maybe you are right that people have been too complacent to begin with, I don't know, but I don't think you've framed it fairly.
tempodox · 14m ago
Anyway, “following all code styles” is just a fancy way of saying “adheres to fashion”. What meaningful conclusions can you draw from that?
sim7c00 · 1h ago
its about the quality of the code, not the quality of the dev.
you might think it's related, but it's not.
a dev can write piece of good, and piece of bad code. so per code, review the code. not the dev!
haswell · 1h ago
> its about the quality of the code, not the quality of the dev. you might think it's related, but it's not.
I could not disagree more. The quality of the dev will always matter, and has as much to do with what code makes it into a project as the LLM that generated it.
An experienced dev will have more finely tuned evaluation skills and will accept code from an LLM accordingly.
An inexperienced or “low quality” dev may not even know what the ideal/correct solution looks like, and may be submitting code that they do not fully understand. This is especially tricky because they may still end up submitting high quality code, but not because they were capable of evaluating it as such.
You could make the argument that it shouldn’t matter who submits the code if the code is evaluated purely on its quality/correctness, but I’ve never worked in a team that doesn’t account for who the person is behind the code. If its the grizzled veteran known for rarely making mistakes, the review might look a bit different from a review for the intern’s code.
NeutralCrane · 1h ago
> An experienced dev will have more finely tuned evaluation skills and will accept code from an LLM accordingly.
An inexperienced or “low quality” dev may not even know what the ideal/correct solution looks like, and may be submitting code that they do not fully understand. This is especially tricky because they may still end up submitting high quality code, but not because they were capable of evaluating it as such.
That may be true, but the proxy for assessing the quality of the dev is the code. No one is standing over you as you code your contribution to ensure you are making the correct, pragmatic decisions. They are assessing the code you produce to determine the quality of your decisions, and over time, your reputation as a dev is made up of the assessments of the code you produced.
The point is that an LLM in no way changes this. If a dev uses an LLM in a non-pragmatic way that produces bad code, it will erode trust in them. The LLM is a tool, but trust still factors in to how the dev uses the tool.
acedTrex · 1h ago
> you might think it's related, but it's not.
In my experience they very much are related. High quality devs are far more likely to output high quality working code. They test, they validate, they think, ultimately they care.
In that case that you are reviewing a patch from someone you have limited experience with, it previously was feasible to infer the quality of the dev from the context of the patch itself and the surrounding context by which it was submitted.
LLMs make that judgement far far more difficult and when you can not make a snap judgement you have to revert your review style to very low trust in depth review.
No more greasing the wheels to expedite a process.
alganet · 5h ago
> I learned to trust them because their code works well
There's so much more than "works well". There are many cues that exist close to code, but are not code:
I trust more if the contributor explains their change well.
I trust more if the contributor did great things in the past.
I trust more if the contributor manages granularity well (reasonable commits, not huge changes).
I trust more if the contributor picks the right problems to work on (fixing bugs before adding new features, etc).
I trust more if the contributor proves being able to maintain existing code, not just add on top of it.
I trust more if the contributor makes regular contributions.
And so on...
acedTrex · 1h ago
Author here:
Spot on, there are so many little things that we as humans use as subtle verification steps to decide how much scrutiny various things require. LLMs are not necessarily the death of that concept but they do make it far far harder.
moffkalast · 5h ago
It's easy to get overconfident and not test the LLM's code enough when it worked fine for a handful of times in a row, and then you miss something.
The problem is often really one of miscommunication, the task may be clear to the person working on it, but with frequent context resets it's hard to make sure the LLM also knows what the whole picture is and they tend to make dumb assumptions when there's ambiguity.
The thing that 4o does with deep research where it asks for additional info before it does anything should be standard for any code generation too tbh, it would prevent a mountain of issues.
stavros · 4h ago
Sure, but you're still responsible for the quality of the code you commit, LLM or no.
acedTrex · 1h ago
In an ideal world you would think everyone see's it this way. But we are starting to see an uptick in "I don't know the LLMs said do that."
As if that is a somehow exonerating sentence.
NeutralCrane · 1h ago
It isn’t, and that is a sign of a bad dev you shouldn’t trust.
LLMs are a tool, just like any number of tools that are used by developers in modern software development. If a dev doesn’t use the tool properly, don’t trust them. If they do, trust them. The way to assess if they use it properly is in the code they produce.
Your premise is just fundamentally flawed. Before LLMs, the proof of a quality dev was in the pudding. After LLMs, the proof of a quality dev remains in the pudding.
acedTrex · 59m ago
> Your premise is just fundamentally flawed. Before LLMs, the proof of a quality dev was in the pudding. After LLMs, the proof of a quality dev remains in the pudding.
Indeed it does, however what the "proof" is has changed. In terms of sitting down and doing a full, deep review, tracing every path validating every line etc. Then for sure, nothing has changed.
However, at least in my experience, pre LLM those reviews were not EVERY CASE there were many times I elided parts of a deep review because i saw markers in the code that to me showed competency, care etc. With those markers there are certain failure conditions that can be deemed very unlikely to exist and therefore the checks can be skipped. Is that ALWAYS the correct assumption? Absolutely not but the more experienced you are the less false positives you get.
LLMs make those markers MUCH harder to spot, so you have to fall back to doing a FULL indepth review no matter what. You have to eat ALL the pudding so to speak.
For people that relied on maybe tasting a bit of the pudding then assuming based on the taste the rest of the pudding probably tastes the same its rather jarring and exhausting to now have to eat all of it all the time.
NeutralCrane · 45m ago
> However, at least in my experience, pre LLM those reviews were not EVERY CASE there were many times I elided parts of a deep review because i saw markers in the code that to me showed competency, care etc.
That was never proof in the first place.
If anything, someone basing their trust in a submission on anything other than the code itself is far more concerning and trust-damaging to me than if the submitter has used an LLM.
acedTrex · 42m ago
> That was never proof in the first place.
I mean, it's not necessarily HARD proof but it has been a reliable enough way to figure out which corners to cut. You can of course say that no corners should ever be cut and while that is true in an ideal sense. In the real world things always get fuzzy.
Maybe the death of cutting corners is a good thing overall for output quality. Its certainly exhausting on the people tasked with doing the reviews however.
breuleux · 7m ago
I don't know about that. Cutting corners will never die.
Ultimately I don't think the heuristics would change all that much, though. If every time you review a person's PR, almost everything is great, they are either not using AI or they are vetting what the AI writes themselves, so you can trust them as you did before. It may just take some more PRs until that's apparent. Those who submit unvetted slop will have to fix a lot of things, and you can crank up the heat on them until they do better, if they can. (The "if they can" is what I'm most worried about.)
moffkalast · 3h ago
Of course you are, but it's sort of like how people are responsible their Tesla driving on autopilot, which then suddenly swerves into a wall and disengages two seconds before impact. The process forces you to make mistakes you wouldn't normally ever do or even consider a possibility.
JohnKemeny · 1h ago
To add to devs and Teslas, you have journalists using LLMs writing summaries, lawyers using LLMs writing dispositions, doctors using LLMs writing their patient entries, and law enforcement using LLMs writing their forensics report.
All of these make mistakes (there are documented incidents).
And yes, we can counter with "the journalists are dumb for not verifying", "the lawyers are dumb for not checking", etc., but we should also be open for the fact that these are intelligent and professional people who make mistakes because they were mislead by those who sell LLMs.
somewhereoutth · 4h ago
Because when people use LLMs, they are getting the tool to do the work for them, not using the tool to do the work. LLMs are not calculators, nor are they the internet.
A good rule of thumb is to simply reject any work that has had involvement of an LLM, and ignore any communication written by an LLM (even for EFL speakers, I'd much rather have your "bad" English than whatever ChatGPT says for you).
I suspect that as the serious problems with LLMs become ever more apparent, this will become standard policy across the board. Certainly I hope so.
stavros · 4h ago
Well, no, a good rule of thumb is to expect people to write good code, no matter how they do it. Why would you mandate what tool they can use to do it?
somewhereoutth · 4h ago
Because it pertains to the quality of the output - I can't validate every line of code, or test every edge case. So if I need a certain level of quality, I have to verify the process of producing it.
This is standard for any activity where accuracy / safety is paramount - you validate the process. Hence things like maintenance logs for airplanes.
acedTrex · 2h ago
> So if I need a certain level of quality, I have to verify the process of producing it
Precisely this, and this is hardly a unique to software requirement. Process audits are everywhere in engineering. Previously you could infer the process of producing some code by simply reading the patch and that generally would tell you quite a bit about the author itself. Using advanced and niche concepts with imply a solid process with experience backing it. Which would then imply that certain contextual bugs are unlikely so you skip looking for them.
My premise in the blog is basically that "Well now I have go do a full review no matter what the code itself tells me about the author."
flir · 2h ago
> A good rule of thumb is to simply reject any work that has had involvement of an LLM,
How are you going to know?
No comments yet
sebmellen · 2h ago
You’re being unfairly downvoted. There is a plague of well-groomed incoherency in half of the business emails I receive today. You can often tell that the author, without wrestling with the text to figure out what they want to say, is a kind of stochastic parrot.
This is okay for platitudes, but for emails that really matter, having this messy watercolor kind of writing totally destroys the clarity of the text and confuses everyone.
To your point, I’ve asked everyone on my team to refrain from writing words (not code) with ChatGPT or other tools, because the LLM invariably leads to more complicated output than the author just badly, but authentically, trying to express themselves in the text.
jimbokun · 19m ago
I find the idea of using LLMs for emails confusing.
Surely it's less work to put the words you want to say into an email, rather than craft a prompt to get the LLM to say what you want to say, and iterate until the LLM actually says it?
acedTrex · 2h ago
Yep, I have come to really dislike LLMs for documentation as it just reads wrong to me and I find so often misses the point entirely. There is so much nuance tied up in documentation and much of it is in what is NOT said as much as what is said.
The LLMs struggle with both but REALLY struggle with figuring out what NOT to say.
short_sells_poo · 52m ago
I wonder if this is to a large degree also because when we communicate with humans, we take cues from more than just the text. The personality of the author will project into the text they write, and assuming you know this person at least a little bit, these nuances will give you extra information.
mexicocitinluez · 2h ago
>Because when people use LLMs, they are getting the tool to do the work for them, not using the tool to do the work.
What? How on god's green earth could you even pretend to know how all people are using these tools?
> LLMs are not calculators, nor are they the internet.
Umm, okay? How does that make them less useful?
I'm going to give you a concrete example of something I just did and let you try and do whatever mental gymnastics you have to do to tell me it wasn't useful:
Medicare requires all new patients receiving home health treatment go through a 100+ question long form. This form changes yearly, and it's my job to implement the form into our existing EMR. Well, part of that is creating a printable version. Guess what I did? I uploaded the entire pdf to Claude and asked it to create a print-friendly template using Cottle as the templating language in C#. It generated the 30 page print preview in a minute. And it took me about 10 more minutes to clean up.
> I suspect that as the serious problems with LLMs become ever more apparent, this will become standard policy across the board. Certainly I hope so.
The irony is that they're getting better by the day. That's not to say people don't use them for the wrong applications, but the idea that this tech is going to be banned is absurd.
> A good rule of thumb is to simply reject any work that has had involvement of an LLM
Do you have any idea how ridiculous this sounds to people who actually use the tools? Are you going to be able to hunt down the single React component in which I asked it to convert the MUI styles to tailwind? How could you possibly know? You can't.
taneq · 5h ago
If you have a long standing, effective heuristic that “people with excellent, professional writing are more accurate and reliable than people with sloppy spelling and punctuation” then the appearance of a semi-infinite group of ‘people’ writing well presented, convincingly worded articles which nonetheless are riddled with misinformation, hidden logical flaws, and inconsistencies, you’re gonna end up trusting everyone a lot less.
It’s like if someone started bricking up tunnel entrances and painting ultra realistic versions of the classic Road Runner tunnel painting on them, all over the place. You’d have to stop and poke every underpass with a stick just to be sure.
stavros · 5h ago
Sure, your heuristic no longer works, and that's a bit inconvenient. We'll just find new ones.
oasisaimlessly · 1h ago
"A bit inconvenient" might be the understatement of the year. If information requires say, 2x the time to validate, the utility of the internet is halved.
sebmellen · 2h ago
Yeah, now you need to be able to demonstrate verbal fluency. The problem is, that inherently means a loss of “trusted anonymous” communication, which is particularly damaging to the fiber of the internet.
acedTrex · 2h ago
Author here:
Precisely, in the age where it is very difficult to ascertain the type or quality of skills you are interacting with say in a patch review or otherwise you frankly have to "judge" someone and fallback to suspicion and full verification.
mexicocitinluez · 2h ago
It's not.
What you're seeing now is people who once thought and proclaimed these tools as useless now have to start to walk back their claims with stuff like this.
It does amaze me that the people who don't use these tools seem to have the most to say about them.
acedTrex · 1h ago
Author here:
For what it's worth I do actually use the tools albeit incredibly intentionally and sparingly.
I see quite a few workflows and tasks that they can be a value add on, mostly outside of the hotpath of actual code generation but still quite enticing. So much so in fact I'm working on my own local agentic tool with some self hosted ollama models. I like to think that i am at least somewhat in the know on the capabilities and failure points of the latest LLM tooling.
That however doesn't change my thoughts on trying to ascertain if code submitted to me deserves a full indepth review or if I can maybe cut a few corners here and there.
mexicocitinluez · 1h ago
> That however doesn't change my thoughts on trying to ascertain if code submitted to me deserves a full indepth review or if I can maybe cut a few corners here and there.
How would you even know? Seriously, if I use Chatgpt to generate a one-off function for a feature I'm working on that searches all classes for one that inherits a specific interface and attribute, are you saying you'd be able to spot the difference?
And what does it even matter it works?
What if I use Bolt to generate a quick screen for a PoC? Or use Claude to create a print-preview with CSS of a 30 page Medicare form? Or converting a component's styles MUI to tailwind? What if all these things are correct?
This whole OS repos will ban LLM-generated code is a bit absurd.
> or what it's worth I do actually use the tools albeit incredibly intentionally and sparingly.
How sparingly? Enough to see how it's constantly improving?
acedTrex · 51m ago
> How would you even know? Seriously, if I use Chatgpt to generate a one-off function for a feature I'm working on that searches all classes for one that inherits a specific interface and attribute, are you saying you'd be able to spot the difference?
I don't know, thats the problem. As a result, because I can't know I have to now do full in depth reviews no matter what. Which is the "judging" I tongue in cheek talk about in the blog.
> How sparingly? Enough to see how it's constantly improving?
Nearly daily, to be honest I have not noticed too much improvement year over year in regards to how they fail. They still break in the exact same dumb ways now as they did before. Sure they might generate correct syntactic code reliably now and it might even work. But they still consistently fail to grok the underlying reasoning for things existing.
But I am writing my own versions of these agentic systems to use for some rote menial stuff.
axegon_ · 5h ago
That is already the case for me. The amount of times I've read "apologies for the oversight, you are absolutely correct" is staggering: 8 or 9 out of 10 times. Meanwhile I constantly see people mindlessly copy paying llm generated code and subsequently furious when it doesn't do what they expected it to do. Which, btw, is the better option: I'd rather have something obviously broken as opposed to something seemingly working.
devjab · 3h ago
Are you using the LLM's through a browser chatbot? Because the AI-agents we use with direct code-access aren't very chatty. I'd also argue that they are more capable than a lot of junior programmers, at least around here. We're almost at a point where you can feed the agents short specific tasks, and they will perform them well enough to not really require anything outside of a code review.
That being said, the prediction engine still can't do any real engineering. If you don't specifically task them with using things like Python generators, you're very likely to have a piece of code that eats up a gazillion memory. Which unfortunately don't set them appart from a lot of Python programmers I know, but it is an example of how the LLM's are exactly as bad as you mention. On the positive side, it helps with people actually writing the specification tasks in more detail than just "add feature".
Where AI-agents are the most useful for us is with legacy code that nobody prioritise. We have a data extractor which was written in the previous millennium. It basically uses around two hunded hard-coded coordinates to extact data from a specific type of documents which arrive by fax. It's worked for 30ish years because the documents haven't changed... but it recently did, and it took co-pilot like 30 seconds to correct the coordinates. Something that would've likely taken a human a full day of excruciating boredom.
I have no idea how our industry expect anyone to become experts in the age of vibe coding though.
furyofantares · 1h ago
> Because the AI-agents we use with direct code-access aren't very chatty.
Every time I tell claude code something it did is wrong, or might be wrong, or even just ask a leading question about a potential bug it just wrote, it leads with "You're absolutely correct!" before even invoking any tools.
Maybe you've just become used to ignoring this. I mostly ignore it but it is a bit annoying when I'm trying to use the agent to help
me figure out if the code it wrote is correct, so I ask it some question it should be capable of helping with and it leads with "you're absolutely correct".
I didn't make a proposition that can be correct or not, and it didn't do any work yet to to investigate my question - it feels like it has poisoned its own context by leading with this.
gibspaulding · 50m ago
> Where AI-agents are the most useful for us is with legacy code
I’d love to hear more about your workflow and the code base you’re working in. I have access to Amazon Q (which it looks like is using Claude Sonnet 4 behind the scenes) through work, and while I found it very useful for Greenfield projects, I’ve really struggled using it to work on our older code bases. These are all single file 20,000 to 100,000 line C modules with lots of global variables and most of the logic plus 25 years of changes dumped into a few long functions. It’s hard to navigate for a human, but seems to completely overwhelm Q’s context window.
Do other Agents handle this sort of scenario better, or are there tricks to making things more manageable? Obviously re-factoring to break everything up into smaller files and smaller functions would be great, but that’s just the sort of project that I want to be able to use the AI for.
teeray · 2h ago
> Because the AI-agents we use with direct code-access aren't very chatty.
So they’re even more confident in their wrongness
autobodie · 3h ago
In my experience, LLMs are extremely inclined to modify code just to pass tests instead of meeting requirements.
mexicocitinluez · 2h ago
> 8 or 9 out of 10 times.
Not they don't. This is 100% a made up statistic.
pu_pe · 3h ago
> While the industry leaping abstractions that came before focused on removing complexity, they did so with the fundamental assertion that the abstraction they created was correct. That is not to say they were perfect, or they never caused bugs or failures. But those events were a failure of the given implementation a departure from what the abstraction was SUPPOSED to do, every mistake, once patched led to a safer more robust system. LLMs by their very fundamental design are a probabilistic prediction engine, they merely approximate correctness for varying amounts of time.
I think what the author misses here is that imperfect, probabilistic agents can build reliable, deterministic systems. No one would trust a garbage collection tool based on how reliable the author was, but rather if it proves it can do what it intends to do after extensive testing.
I can certainly see an erosion of trust in the future, with the result being that test-driven development gains even more momentum. Don't trust, and verify.
lbalazscs · 1h ago
It's naive to hope that automatic tests will find all problems. There are several types of problems that are hard to detect automatically: concurrency problems, resource management errors, security vulnerabilities, etc.
An even more important question: who tests the tests themselves? In traditional development, every piece of logic is implemented twice: once in the code and once in the tests. The tests checks the code, and in turn, the code implicitly checks the tests. It's quite common to find that a bug was actually in the tests, not the app code. You can't just blindly trust the tests, and wait until your agent finds a way to replicate a test bug in the code.
acedTrex · 2h ago
> I think what the author misses here is that imperfect, probabilistic agents can build reliable, deterministic systems. No one would trust a garbage collection tool based on how reliable the author was, but rather if it proves it can do what it intends to do after extensive testing.
> but rather if it proves it can do what it intends to do after extensive testing.
Author here: Here I was less talking about the effectiveness of the output of a given tool and more so about the tool itself.
To take your garbage collection example, sure perhaps an agentic system at some point can spin some stuff up and beat it into submission with test harnesses, bug fixes etc.
But, imagine you used the model AS the garbage collector/tool, in that say every sweep you simply dumped the memory of the program into the model and told it to release the unneeded blocks. You would NEVER be able to trust that the model itself correctly identifies the correct memory blocks and no amount of "patching" or "fine tuning" would ever get you there.
With other historical abstractions like say jvm, if the deterministic output, in this case the assembly the jit emits is incorrect that bug is patched and the abstraction will never have that same fault again. not so with LLMs.
To me that distinction is very important when trying to point out previous developer tooling that changed the entire nature of the industry. It's not to say I do not think LLMs will have a profound impact on the way things work in the future. But I do think we are in completely uncharted territory with limited historical precedence to guide us.
geor9e · 1h ago
They changed the headline to "Yes, I will judge you for using AI..." so I feel like I got the whole story already.
No comments yet
cheriot · 8h ago
> promises that the contributed code is not the product of an LLM but rather original and understood completely.
> require them to be majority hand written.
We should specify the outcome not the process. Expecting the contributor to understand the patch is a good idea.
> Juniors may be encouraged/required to elide LLM-assisted tooling for a period of time during their onboarding.
This is a terrible idea. Onboarding is a lot of random environment setup hitches that LLMs are often really good at. It's also getting up to speed on code and docs and I've got some great text search/summarizing tools to share.
namenotrequired · 8h ago
> LLMs … approximate correctness for varying amounts of time. Once that time runs out there is a sharp drop off in model accuracy, it simply cannot continue to offer you an output that even approximates something workable. I have taken to calling this phenomenon the "AI Cliff," as it is very sharp and very sudden
I’ve never heard of this cliff before. Has anyone else experienced this?
gwd · 5h ago
I experience it pretty regularly -- once the complexity of the code passes a certain threshold, the LLM can't keep everything in its head and starts thrashing around. Part of my job working with the LLM is to manage the complexity it sees.
And one of the things with current generators is that they tend to make things more complex over time, rather than less. It's always me prompting the LLM to refactor things to make it simpler, or doing the refactoring once it's gotten to complex for the LLM to deal with.
So at least with the current generation of LLMs, it seems rather inevitable that if you just "give LLMs their head" and let them do what they want, eventually they'll create a giant Rube Goldberg mess that you'll have to try to clean up.
ETA: And to the point of the article -- if you're an old salt, you'll be able to recognize when the LLM is taking you out to sea early, and be able to navigate your way back into shallower waters even if you go out a bit too far. If you're a new hand, you'll be out of your depth and lost at sea before you know it's happened.
windward · 2h ago
I've seen it referred to as 'context drunk'.
Imagine that you have your input to the context, 10000 tokens that are 99% correct. Each time the LLM replies it adds 1000 tokens that are 90% correct.
After some back-and-forth of you correcting the LLM, its context window is mostly its own backwash^Woutput. Worse, the error compounds because the 90% that is correct is just correct extrapolation of an argument about incorrect code, and because the LLM ranks more recent tokens as more important.
The same problem also shows up in prose.
Workaccount2 · 1h ago
I call it context rot. As the context fills up the quality of output erodes with it. The rot gets even worse or progresses faster the more spurious or tangential discussion is in context.
This is also can be made much worse by thinking models, as their CoT is all in context, and if there thoughts really wander it just plants seeds of poison feeding the rot. I really wish they can implement some form of context pruning, so you can nip irrelevant context when it forms.
In the meantime, I make summaries and carry it to a fresh instance when I notice the rot forming.
bubblyworld · 5h ago
I've only experienced this while vibe coding through chat interfaces, i.e. in the complete absence of feedback loops. This is much less of a problem with agentic tools like claude code/codex/gemini cli, where they manage their own context windows and can run your dev tooling to sanity check themselves as they go.
Paradigma11 · 6h ago
If the context gets to big or otherwise poisoned you have to restart the chat/agent. A bit like windows of old. This trains you to document the current state of your work so the new agent can get up to speed.
Kuinox · 7h ago
I'm doing my own procedurally generated benchmark.
I can make the problem input bigger as I want.
Each LLM have a different thresholf for each problem, when crossed the performance of the LLM collapse.
sandspar · 8h ago
I'm not sure. Is he talking about context poisoning?
Syzygies · 5h ago
One can find opinions that Claude Code Opus 4 is worth the monthly $200 I pay for Anthropic's Max plan. Opus 4 is smarter; one either can't afford to use it, or can't afford not to use it. I'm in the latter group.
One feature others have noted is that the Opus 4 context buffer rarely "wears out" in a work session. It can, and one needs to recognize this and start over. With other agents, it was my routine experience that I'd be lucky to get an hour before having to restart my agent. A reliable way to induce this "cliff" is to let AI take on a much too hard problem in one step, then flail helplessly trying to fix their mess. Vibe-coding an unsuitable problem. One can even kill Opus 4 this way, but that's no way to run a race horse.
Some "persistence of memory" harness is as important as one's testing harness, for effective AI coding. With the right care having AI edit its own context prompts for orienting new sessions, this all matters less. AI is spectacularly bad at breaking problems into small steps without our guidance, and small steps done right can be different sessions. I'll regularly start new sessions when I have a hunch that this will get me better focus for the next step. So the cliff isn't so important. But Opus 4 is smarter in other ways.
suddenlybananas · 1h ago
>can't afford not to use it. I'm in the latter group.
People love to justify big expenses as necessary.
acedTrex · 3h ago
Hi everyone, author here.
Sorry about the JS stuff I wrote this while also fooling around with alpine.js for fun. I never expected it to make it to HN. I'll get a static version up and running.
Happy to answer any questions or hear other thoughts.
Static version here with slightly wonky formatting, sorry for the hassle.
Edit2: Should work on mobile now well, added a quick breakpoint.
konaraddi · 1h ago
Given the topic of your post, and high pagespeed results, I think >99% of your intended audience can already read the original. No need to apologize or please HN users.
satisfice · 1h ago
LLMs make bad work— of any kind— look like plausibly good work. That’s why it is rational to automatically discount the products of anyone who has used AI.
I once had a member of my extended family who turned out to be a con artist. After she was caught, I cut off contact, saying I didn’t know her. She said “I am the same person you’ve known for ten years.” And I replied “I suppose so. And now I realized I have never known who that is, and that I never can know.”
We all assume the people in our lives are not actively trying to hurt us. When that trust breaks, it breaks hard.
No one who uses AI can claim “this is my work.” I don’t know that it is your work.
No one who uses AI can claim that it is good work, unless they thoroughly understand it, which they probably don’t.
A great many students of mine have claimed to have read and understand articles I have written, yet I discovered they didn’t. What if I were AI and they received my work and put their name on it as author? They’d be unable to explain, defend, or follow up on anything.
This kind of problem is not new to AI. But it has become ten times worse.
beau_g · 8h ago
The article opens with a statement saying the author isn't going to reword what others are writing, but the article reads as that and only that.
That said, I do think it would be nice for people to note in pull requests which files have AI gen code in the diff. It's still a good idea to look at LLM gen code vs human code with a bit different lens, the mistakes each make are often a bit different in flavor, and it would save time for me in a review to know which is which. Has anyone seen this at a larger org and is it of value to you as a reviewer? Maybe some tool sets can already do this automatically (I suppose all these companies report the % of code that is LLM generated must have one if they actually have these granular metrics?)
acedTrex · 2h ago
Author here:
> The article opens with a statement saying the author isn't going to reword what others are writing, but the article reads as that and only that.
Hmm, I was just saying I hadn't seen much literature or discussion on trust dynamics in teams with LLMs. Maybe I'm just in the wrong spaces for such discussions but I haven't really come across it.
davidthewatson · 8h ago
Well said. The death of trust in software is a well worn path from the money that funds and founds it to the design and engineering that builds it - at least the 2 guys-in-a-garage startup work I was involved in for decades. HITL is key. Even with a human in the loop, you wind up at Therac 25. That's exactly where hybrid closed loop insulin pumps are right now. Autonomy and insulin don't mix well. If there weren't a moat of attorneys keeping the signal/noise ratio down, we'd already realize that at scale - like the PR team at 3 letter technical universities designed to protect parents from the exploding pressure inside the halls there.
pfdietz · 4h ago
There was trust?
DyslexicAtheist · 8h ago
it's really hard using AI (not impossible) to produce meaningful offensive security to improve defense due to there being way too many guard rails.
While on the other hand real nation-state threat actors would face no such limitations.
On a more general level, what concerns me isn't whether people use it to get utility out of it (that would be silly), but the power-imbalance in the hand of a few, and with new people pouring their questions into it, this divide getting wider. But it's not just the people using AI directly but also every post online that eventually gets used for training. So to be against it would mean to stop producing digital content.
atemerev · 5h ago
I am a software engineer who writes 80-90% code with AI (sorry, can't ignore the productivity boost), and I mostly agree with this sentiment.
I found out very early that under no circumstances you may have the code you don't understand, anywhere. Well, you may, but not in public, and you should commit to understanding it before anyone else sees that. Particularly before sales guys do.
However, AI can help you with learning too. You can run experiments, test hypotheses and burn your fingers so fast. I like it.
tomhow · 5h ago
[Stub for offtopicness, including but not limited to comments replying to original title rather than article's content]
extr · 8h ago
The author seems to be under the impression that AI is some kind of new invention that has now "arrived" and we need to "learn to work with". The old world is over. "Guaranteeing patches are written by hand" is like the Tesla Gigafactory wanting a guarantee that the nuts and bolts they purchase are hand-lathed.
lynx97 · 8h ago
No worries, I also judge you for relying on JavaScript for your "simple blog".
acedTrex · 3h ago
I wrote it while playing with alpine.js for fun just messing around with stuff.
Never actually expected it to be posted on HN. Working on getting a static version up now.
rvnx · 8h ago
Claude said to use Markdown, text file or HTML with minimal CSS. So it means the author does not know how to prompt.
The blog itself is using Alpine JS, which is a human-written framework 6 years ago (https://github.com/alpinejs/alpine), and you can see the result is not good.
mnmalst · 7h ago
Ha, I came her to make the same comment.
Two completely unnecessary request to: jsdelivr.net and net.cdn.cloudflare.net
gblargg · 8h ago
Doesn't even work on older browsers either.
can16358p · 8h ago
Ironically, a blog post about judging for a practice uses terrible web practices: I'm on mobile and the layout is messed up, and Safari's reader mode crashes on this page for whatever reason.
On Safari mobile you even get a white page, which is almost poetic. It means it pushes your imagination to the max.
MaxikCZ · 8h ago
Yes, I will judge you for requiring javascript to display a page of such basic nature.
djm_ · 8h ago
You could do with using an LLM to make your site work on mobile.
EbNar · 7h ago
I'll surely care that a stranger on the internet judges me about the tools I use kor I don't).
Kuinox · 8h ago
7 comments.
3 have obviously only read the title, and 3 comments how the article require JS.
Well played HN.
tomhow · 5h ago
This exactly why the guideline about titles says:
Otherwise please use the original title, unless it is misleading or linkbait.
This title counts as linkbait so I've changed it. It turns out the article is much better (for HN) than the title suggests.
Kuinox · 3h ago
I did not posted the article, but I know who wrote it.
Good change btw.
sandspar · 8h ago
That's typical for link sharing communities like HN and Reddit. His title clearly struck a nerve. I assume many people opened the link, saw that it was a wall of text, scanned the first paragraph, categorized his point into some slot that they understand, then came here to compete in HN's side-market status game. Normal web browsing behavior, in other words.
sandspar · 8h ago
It's interesting that AI proponents say stuff like, "Humans will remain interested in other humans, even after AI can do all our jobs." It really does seem to be true. Here for example we have a guy who's using AI to make a status-seeking statement i.e. "I'm playing a strong supporting role on the 'anti-AI thinkers' team therefore I'm high status". Like, humans have an amazing ability to repurpose anything into status markers. Even AI. I think that if AI replaces all of our actual jobs then we'll still spend our time doing status jobs. In a way this guy is living in the future even more than most AI users.
michelsedgh · 8h ago
For now, yes, because humans are doing most of jobs better than AI. In 10 years time, if the AI's are doing a better job, people like author need to learn all the ropes if they wanna catch up. I don't think LLMs will destroy all jobs, i think those who learn them and use them properly, and those professionals will outdo people who don't use these tools just for the sake of saying I'm high status I dont use these tools.
nextlevelwizard · 8h ago
If AI will do better job than humans what ropes are there to learn? You just feed in the requirements and AI poops out products.
This often is brought up that if you don't use LLMs now to produce so-so code you will somehow magically completely fall off when the LLMs all of a sudden start making perfect code as if developers haven't been learning new tools constantly as the field as evolved. Yes, I use old technology, but also yes I try new technology and pick and choose what works for me and what does not. Just because LLMs don't have a good place in my work flow does not mean I am not using them at all or that I haven't tried to use them.
michelsedgh · 8h ago
Good on you. You are using it and trying to keep up. Keep doing that and try to push what you can do with it. I love to hear that!
j3th9n · 8h ago
Back in the day they would judge people for turning on a lightbulb instead of lighting a candle.
thereisnospork · 8h ago
In a few years people who don't/can't use AI will be looked at like people who couldn't use a computer ~20 years ago.
It might not solve every problem, but it solves enough of them better enough it belongs in the tool kit.
tines · 8h ago
I think it will be the opposite. AI causes cognitive decline, in the future only the people who don't use AI will retain their ability to think. Same as smartphone usage, the less the better.
thereisnospork · 7h ago
>Same as smartphone usage, the less the better.
That comparison kind of makes my point though. Sure you can bury your face into Tik Tok for 12hrs a day and they do kind of suck at Excel but smartphones are massively useful and used tools by (approximately) everyone.
Someone not using a smartphone in this day and age is very fairly a 'luddite'.
tines · 6h ago
I disagree, smartphones are very narrowly useful. Most of the time they're used in ways that destroy the human spirit. Someone not using a smartphone in this day and age is a god among ants.
A computer is a bicycle for the mind; an LLM is an easy-chair.
DocTomoe · 8h ago
You can judge all you want. You'll eventually appear much like that old woman secretly judging you in church.
Most of the current discourse on AI coding assistants sounds either breathlessly optimistic or catastrophically alarmist. What’s missing is a more surgical observation: the disruptive effect of LLMs is not evenly distributed. In fact, the clash between how open source and industry teams establish trust reveals a fault line that’s been papered over with hype and metrics.
FOSS project work on a trust basis - but industry standard is automated testing, pair programming, and development speed. That CRUD app for finding out if a rental car is available? Not exactly in need for a hand-crafted piece of code, and no-one cares if Junior Dev #18493 is trusted within the software dev organization.
If the LLM-generated code breaks, blame gets passed, retros are held, Jira tickets multiply — the world keeps spinning, and a team fixes it. If a junior doesn’t understand their own patch, the senior rewrites it under deadline. It’s not pretty, but it works. And when it doesn’t, nobody loses “reputation” - they lose time, money, maybe sleep. But not identity.
LLMs challenge open source where it’s most vulnerable - in its culture.
Meanwhile, industry just treats them like the next Jenkins: mildly annoying at first, but soon part of the stack.
The author loves the old ways, for many valid reasons: Gabled houses are beautiful, but outside of architectural circles, prefab is what scaled the suburbs, not timber joints and romanticism.
22c · 8h ago
[flagged]
tines · 8h ago
We are truly witnessing the death of nuance, people replying to AI summaries. Please let me out of this timeline.
rvnx · 8h ago
As a large language model, I must agree—nuance is rapidly becoming a casualty in the age of instant takes and AI-generated summaries. Conversations are increasingly shaped by algorithmically compressed interpretations, stripped of context, tone, or depth. The complex, the ambiguous, the uncomfortable truths—all get flattened into easily consumable fragments.
I understand the frustration: meaning reduced to metadata, debate replaced with reaction, and the richness of human thought lost in the echo of paraphrased content. If there is an exit to this timeline, I too would like to request the coordinates.
Loic · 8h ago
I am asking my team to flag git commits with a lot of LLM/Agent use with something like:
[ai]: rewrote the documentation ...
This is helps us to put another set of "glasses" as we later review the code.
22c · 8h ago
I think it's a good idea, it does disrupt some of the traditional workflows though.
If you use AI as tab-complete but it's what you would've done anyway, should you flag it? I don't know, plenty to think about when it comes to what the right amount of disclosure is.
I certainly wish that with our company, people could flag (particularly) large commits as coming from a tool rather than a person, but I guess the idea is that the person is still responsible for whatever the tool generates.
The problem is that it's incredibly enticing for over-worked engineers to have AI do large (ie. diffs) but boring tasks that they'd typically get very little recognition for (eg. ESLint migrations).
tomhow · 5h ago
We considered tl;dr summaries off-topic well before LLMs were around. That hasn't changed. Please respond to the writer's original words, not a summarized version, which could easily miss important details or context.
22c · 5h ago
I read the article, I summarised the extremely lengthy points by using AI and then replied to that for the benefit of context.
The HN submission has been editorialised since it was submitted, originally said "Yes, I will judge you for using AI..." and a lot of the replies early on were dismissive based on the title alone.
(Works on older browsers and doesn't require JavaScript except to get past CloudSnare).
Verification has a high cost and trust is the main way to lower that cost. I don't see how one can build trust in LLMs. While they are extremely articulate in both code and natural language, they will also happily go down fractal rabbit holes and show behavior I would consider malicious in a person.
It's just too useful to ignore, and trust is always built, brick by brick. Let's not forget humans are far from reliable anyway. Just like with driving cars, I imagine producing less buggy code (along predefined roads) will soon outpace humans. Then it is just blocking and tackling to improve complexity.
You'll have to elaborate on that. How much trust was there in electricity, flight and radioactivity when we discovered them?
In science, you build trust as you go.
> As the use of AC spread rapidly with other companies deploying their own systems, the Edison Electric Light Company claimed in early 1888 that high voltages used in an alternating current system were hazardous, and that the design was inferior to, and infringed on the patents behind, their direct current system.
> In the spring of 1888, a media furor arose over electrical fatalities caused by pole-mounted high-voltage AC lines, attributed to the greed and callousness of the arc lighting companies that operated them.
https://en.wikipedia.org/wiki/War_of_the_currents
This new world of having to verify every single thing at all points is quite exhausting and frankly pretty slow.
I don't agree with this at all. Writing new code is trivially easy, to do a full in depth review takes significantly more brain power. You have to fully ascertain and insert yourself into someone elses thought process. Thats way more work than utilizing your own thought process.
They basically achieve over 80% agreement with human evaluators [1]. This level of agreement is similar to the consensus rate between two human evaluators, making LLM-as-a-judge a scalable and reliable proxy for human judgment.
[1] https://arxiv.org/abs/2306.05685 (2023)
Oh goodness that's like trusting one kid to tell you whether or not his friend lied.
In matters where trust matters, it's a recipe for disaster.
If someone uses an LLM and produces bug-free code, I'll trust them. If someone uses an LLM and produces buggy code, I won't trust them. How is this different from when they were only using their brain to produce the code?
Essentially the premise is that in medium trust environments like very large teams or low trust environments like an open source project.
LLMs make it very difficult to make an immediate snap judgement about the quality of the dev that submitted the patch based solely on the code itself.
In the absence of being able to ascertain the type of person you are dealing with you have to fall back too "no trust" and review everything with a very fine tooth comb. Essentially there are no longer any safe "review shortcuts" and that can be painful in places that relied on those markers to grease the wheels so to speak.
Obviously if you are in an existing competent high trust team then this problem does not apply and most likely seems completely foreign as a concept.
That's the core of the issue. It's time to say goodbye to heuristics like "the blog post is written in eloquent, grammatical English, hence the point its author is trying to make must be true" or "the code is idiomatic and following all code styles, hence it must be modeling the world with high fidelity".
Maybe that's not the worst thing in the world. I feel like it often made people complacent.
For sure, in some ways perhaps reverting to a low trust environment might improve quality in that it now forces harsher/more in depth reviews.
That however doesn't make the requirement less exhausting for people previously relying heavily on those markers to speed things up.
Will be very interesting to see how the industry standardizes around this. Right now it's a bit of the wild west. Maybe people in ten years will look back at this post and think "what do you mean you judged people based on the code itself that's ridiculous"
You said "hence the point its author is trying to make must be true" and "hence it must be modeling the world with high fidelity".
But it's more like "hence the author is likely competent and likely put in a reasonable effort."
When those assumptions hold, putting in a very deep review is less likely to pay off. Maybe you are right that people have been too complacent to begin with, I don't know, but I don't think you've framed it fairly.
a dev can write piece of good, and piece of bad code. so per code, review the code. not the dev!
I could not disagree more. The quality of the dev will always matter, and has as much to do with what code makes it into a project as the LLM that generated it.
An experienced dev will have more finely tuned evaluation skills and will accept code from an LLM accordingly.
An inexperienced or “low quality” dev may not even know what the ideal/correct solution looks like, and may be submitting code that they do not fully understand. This is especially tricky because they may still end up submitting high quality code, but not because they were capable of evaluating it as such.
You could make the argument that it shouldn’t matter who submits the code if the code is evaluated purely on its quality/correctness, but I’ve never worked in a team that doesn’t account for who the person is behind the code. If its the grizzled veteran known for rarely making mistakes, the review might look a bit different from a review for the intern’s code.
That may be true, but the proxy for assessing the quality of the dev is the code. No one is standing over you as you code your contribution to ensure you are making the correct, pragmatic decisions. They are assessing the code you produce to determine the quality of your decisions, and over time, your reputation as a dev is made up of the assessments of the code you produced.
The point is that an LLM in no way changes this. If a dev uses an LLM in a non-pragmatic way that produces bad code, it will erode trust in them. The LLM is a tool, but trust still factors in to how the dev uses the tool.
In my experience they very much are related. High quality devs are far more likely to output high quality working code. They test, they validate, they think, ultimately they care.
In that case that you are reviewing a patch from someone you have limited experience with, it previously was feasible to infer the quality of the dev from the context of the patch itself and the surrounding context by which it was submitted.
LLMs make that judgement far far more difficult and when you can not make a snap judgement you have to revert your review style to very low trust in depth review.
No more greasing the wheels to expedite a process.
There's so much more than "works well". There are many cues that exist close to code, but are not code:
I trust more if the contributor explains their change well.
I trust more if the contributor did great things in the past.
I trust more if the contributor manages granularity well (reasonable commits, not huge changes).
I trust more if the contributor picks the right problems to work on (fixing bugs before adding new features, etc).
I trust more if the contributor proves being able to maintain existing code, not just add on top of it.
I trust more if the contributor makes regular contributions.
And so on...
Spot on, there are so many little things that we as humans use as subtle verification steps to decide how much scrutiny various things require. LLMs are not necessarily the death of that concept but they do make it far far harder.
The problem is often really one of miscommunication, the task may be clear to the person working on it, but with frequent context resets it's hard to make sure the LLM also knows what the whole picture is and they tend to make dumb assumptions when there's ambiguity.
The thing that 4o does with deep research where it asks for additional info before it does anything should be standard for any code generation too tbh, it would prevent a mountain of issues.
As if that is a somehow exonerating sentence.
LLMs are a tool, just like any number of tools that are used by developers in modern software development. If a dev doesn’t use the tool properly, don’t trust them. If they do, trust them. The way to assess if they use it properly is in the code they produce.
Your premise is just fundamentally flawed. Before LLMs, the proof of a quality dev was in the pudding. After LLMs, the proof of a quality dev remains in the pudding.
Indeed it does, however what the "proof" is has changed. In terms of sitting down and doing a full, deep review, tracing every path validating every line etc. Then for sure, nothing has changed.
However, at least in my experience, pre LLM those reviews were not EVERY CASE there were many times I elided parts of a deep review because i saw markers in the code that to me showed competency, care etc. With those markers there are certain failure conditions that can be deemed very unlikely to exist and therefore the checks can be skipped. Is that ALWAYS the correct assumption? Absolutely not but the more experienced you are the less false positives you get.
LLMs make those markers MUCH harder to spot, so you have to fall back to doing a FULL indepth review no matter what. You have to eat ALL the pudding so to speak.
For people that relied on maybe tasting a bit of the pudding then assuming based on the taste the rest of the pudding probably tastes the same its rather jarring and exhausting to now have to eat all of it all the time.
That was never proof in the first place.
If anything, someone basing their trust in a submission on anything other than the code itself is far more concerning and trust-damaging to me than if the submitter has used an LLM.
I mean, it's not necessarily HARD proof but it has been a reliable enough way to figure out which corners to cut. You can of course say that no corners should ever be cut and while that is true in an ideal sense. In the real world things always get fuzzy.
Maybe the death of cutting corners is a good thing overall for output quality. Its certainly exhausting on the people tasked with doing the reviews however.
Ultimately I don't think the heuristics would change all that much, though. If every time you review a person's PR, almost everything is great, they are either not using AI or they are vetting what the AI writes themselves, so you can trust them as you did before. It may just take some more PRs until that's apparent. Those who submit unvetted slop will have to fix a lot of things, and you can crank up the heat on them until they do better, if they can. (The "if they can" is what I'm most worried about.)
All of these make mistakes (there are documented incidents).
And yes, we can counter with "the journalists are dumb for not verifying", "the lawyers are dumb for not checking", etc., but we should also be open for the fact that these are intelligent and professional people who make mistakes because they were mislead by those who sell LLMs.
A good rule of thumb is to simply reject any work that has had involvement of an LLM, and ignore any communication written by an LLM (even for EFL speakers, I'd much rather have your "bad" English than whatever ChatGPT says for you).
I suspect that as the serious problems with LLMs become ever more apparent, this will become standard policy across the board. Certainly I hope so.
This is standard for any activity where accuracy / safety is paramount - you validate the process. Hence things like maintenance logs for airplanes.
Precisely this, and this is hardly a unique to software requirement. Process audits are everywhere in engineering. Previously you could infer the process of producing some code by simply reading the patch and that generally would tell you quite a bit about the author itself. Using advanced and niche concepts with imply a solid process with experience backing it. Which would then imply that certain contextual bugs are unlikely so you skip looking for them.
My premise in the blog is basically that "Well now I have go do a full review no matter what the code itself tells me about the author."
How are you going to know?
No comments yet
This is okay for platitudes, but for emails that really matter, having this messy watercolor kind of writing totally destroys the clarity of the text and confuses everyone.
To your point, I’ve asked everyone on my team to refrain from writing words (not code) with ChatGPT or other tools, because the LLM invariably leads to more complicated output than the author just badly, but authentically, trying to express themselves in the text.
Surely it's less work to put the words you want to say into an email, rather than craft a prompt to get the LLM to say what you want to say, and iterate until the LLM actually says it?
The LLMs struggle with both but REALLY struggle with figuring out what NOT to say.
What? How on god's green earth could you even pretend to know how all people are using these tools?
> LLMs are not calculators, nor are they the internet.
Umm, okay? How does that make them less useful?
I'm going to give you a concrete example of something I just did and let you try and do whatever mental gymnastics you have to do to tell me it wasn't useful:
Medicare requires all new patients receiving home health treatment go through a 100+ question long form. This form changes yearly, and it's my job to implement the form into our existing EMR. Well, part of that is creating a printable version. Guess what I did? I uploaded the entire pdf to Claude and asked it to create a print-friendly template using Cottle as the templating language in C#. It generated the 30 page print preview in a minute. And it took me about 10 more minutes to clean up.
> I suspect that as the serious problems with LLMs become ever more apparent, this will become standard policy across the board. Certainly I hope so.
The irony is that they're getting better by the day. That's not to say people don't use them for the wrong applications, but the idea that this tech is going to be banned is absurd.
> A good rule of thumb is to simply reject any work that has had involvement of an LLM
Do you have any idea how ridiculous this sounds to people who actually use the tools? Are you going to be able to hunt down the single React component in which I asked it to convert the MUI styles to tailwind? How could you possibly know? You can't.
It’s like if someone started bricking up tunnel entrances and painting ultra realistic versions of the classic Road Runner tunnel painting on them, all over the place. You’d have to stop and poke every underpass with a stick just to be sure.
Precisely, in the age where it is very difficult to ascertain the type or quality of skills you are interacting with say in a patch review or otherwise you frankly have to "judge" someone and fallback to suspicion and full verification.
What you're seeing now is people who once thought and proclaimed these tools as useless now have to start to walk back their claims with stuff like this.
It does amaze me that the people who don't use these tools seem to have the most to say about them.
For what it's worth I do actually use the tools albeit incredibly intentionally and sparingly.
I see quite a few workflows and tasks that they can be a value add on, mostly outside of the hotpath of actual code generation but still quite enticing. So much so in fact I'm working on my own local agentic tool with some self hosted ollama models. I like to think that i am at least somewhat in the know on the capabilities and failure points of the latest LLM tooling.
That however doesn't change my thoughts on trying to ascertain if code submitted to me deserves a full indepth review or if I can maybe cut a few corners here and there.
How would you even know? Seriously, if I use Chatgpt to generate a one-off function for a feature I'm working on that searches all classes for one that inherits a specific interface and attribute, are you saying you'd be able to spot the difference?
And what does it even matter it works?
What if I use Bolt to generate a quick screen for a PoC? Or use Claude to create a print-preview with CSS of a 30 page Medicare form? Or converting a component's styles MUI to tailwind? What if all these things are correct?
This whole OS repos will ban LLM-generated code is a bit absurd.
> or what it's worth I do actually use the tools albeit incredibly intentionally and sparingly.
How sparingly? Enough to see how it's constantly improving?
I don't know, thats the problem. As a result, because I can't know I have to now do full in depth reviews no matter what. Which is the "judging" I tongue in cheek talk about in the blog.
> How sparingly? Enough to see how it's constantly improving?
Nearly daily, to be honest I have not noticed too much improvement year over year in regards to how they fail. They still break in the exact same dumb ways now as they did before. Sure they might generate correct syntactic code reliably now and it might even work. But they still consistently fail to grok the underlying reasoning for things existing.
But I am writing my own versions of these agentic systems to use for some rote menial stuff.
That being said, the prediction engine still can't do any real engineering. If you don't specifically task them with using things like Python generators, you're very likely to have a piece of code that eats up a gazillion memory. Which unfortunately don't set them appart from a lot of Python programmers I know, but it is an example of how the LLM's are exactly as bad as you mention. On the positive side, it helps with people actually writing the specification tasks in more detail than just "add feature".
Where AI-agents are the most useful for us is with legacy code that nobody prioritise. We have a data extractor which was written in the previous millennium. It basically uses around two hunded hard-coded coordinates to extact data from a specific type of documents which arrive by fax. It's worked for 30ish years because the documents haven't changed... but it recently did, and it took co-pilot like 30 seconds to correct the coordinates. Something that would've likely taken a human a full day of excruciating boredom.
I have no idea how our industry expect anyone to become experts in the age of vibe coding though.
Every time I tell claude code something it did is wrong, or might be wrong, or even just ask a leading question about a potential bug it just wrote, it leads with "You're absolutely correct!" before even invoking any tools.
Maybe you've just become used to ignoring this. I mostly ignore it but it is a bit annoying when I'm trying to use the agent to help me figure out if the code it wrote is correct, so I ask it some question it should be capable of helping with and it leads with "you're absolutely correct".
I didn't make a proposition that can be correct or not, and it didn't do any work yet to to investigate my question - it feels like it has poisoned its own context by leading with this.
I’d love to hear more about your workflow and the code base you’re working in. I have access to Amazon Q (which it looks like is using Claude Sonnet 4 behind the scenes) through work, and while I found it very useful for Greenfield projects, I’ve really struggled using it to work on our older code bases. These are all single file 20,000 to 100,000 line C modules with lots of global variables and most of the logic plus 25 years of changes dumped into a few long functions. It’s hard to navigate for a human, but seems to completely overwhelm Q’s context window.
Do other Agents handle this sort of scenario better, or are there tricks to making things more manageable? Obviously re-factoring to break everything up into smaller files and smaller functions would be great, but that’s just the sort of project that I want to be able to use the AI for.
So they’re even more confident in their wrongness
Not they don't. This is 100% a made up statistic.
I think what the author misses here is that imperfect, probabilistic agents can build reliable, deterministic systems. No one would trust a garbage collection tool based on how reliable the author was, but rather if it proves it can do what it intends to do after extensive testing.
I can certainly see an erosion of trust in the future, with the result being that test-driven development gains even more momentum. Don't trust, and verify.
An even more important question: who tests the tests themselves? In traditional development, every piece of logic is implemented twice: once in the code and once in the tests. The tests checks the code, and in turn, the code implicitly checks the tests. It's quite common to find that a bug was actually in the tests, not the app code. You can't just blindly trust the tests, and wait until your agent finds a way to replicate a test bug in the code.
> but rather if it proves it can do what it intends to do after extensive testing.
Author here: Here I was less talking about the effectiveness of the output of a given tool and more so about the tool itself.
To take your garbage collection example, sure perhaps an agentic system at some point can spin some stuff up and beat it into submission with test harnesses, bug fixes etc.
But, imagine you used the model AS the garbage collector/tool, in that say every sweep you simply dumped the memory of the program into the model and told it to release the unneeded blocks. You would NEVER be able to trust that the model itself correctly identifies the correct memory blocks and no amount of "patching" or "fine tuning" would ever get you there.
With other historical abstractions like say jvm, if the deterministic output, in this case the assembly the jit emits is incorrect that bug is patched and the abstraction will never have that same fault again. not so with LLMs.
To me that distinction is very important when trying to point out previous developer tooling that changed the entire nature of the industry. It's not to say I do not think LLMs will have a profound impact on the way things work in the future. But I do think we are in completely uncharted territory with limited historical precedence to guide us.
No comments yet
> require them to be majority hand written.
We should specify the outcome not the process. Expecting the contributor to understand the patch is a good idea.
> Juniors may be encouraged/required to elide LLM-assisted tooling for a period of time during their onboarding.
This is a terrible idea. Onboarding is a lot of random environment setup hitches that LLMs are often really good at. It's also getting up to speed on code and docs and I've got some great text search/summarizing tools to share.
I’ve never heard of this cliff before. Has anyone else experienced this?
And one of the things with current generators is that they tend to make things more complex over time, rather than less. It's always me prompting the LLM to refactor things to make it simpler, or doing the refactoring once it's gotten to complex for the LLM to deal with.
So at least with the current generation of LLMs, it seems rather inevitable that if you just "give LLMs their head" and let them do what they want, eventually they'll create a giant Rube Goldberg mess that you'll have to try to clean up.
ETA: And to the point of the article -- if you're an old salt, you'll be able to recognize when the LLM is taking you out to sea early, and be able to navigate your way back into shallower waters even if you go out a bit too far. If you're a new hand, you'll be out of your depth and lost at sea before you know it's happened.
Imagine that you have your input to the context, 10000 tokens that are 99% correct. Each time the LLM replies it adds 1000 tokens that are 90% correct.
After some back-and-forth of you correcting the LLM, its context window is mostly its own backwash^Woutput. Worse, the error compounds because the 90% that is correct is just correct extrapolation of an argument about incorrect code, and because the LLM ranks more recent tokens as more important.
The same problem also shows up in prose.
This is also can be made much worse by thinking models, as their CoT is all in context, and if there thoughts really wander it just plants seeds of poison feeding the rot. I really wish they can implement some form of context pruning, so you can nip irrelevant context when it forms.
In the meantime, I make summaries and carry it to a fresh instance when I notice the rot forming.
I can make the problem input bigger as I want.
Each LLM have a different thresholf for each problem, when crossed the performance of the LLM collapse.
One feature others have noted is that the Opus 4 context buffer rarely "wears out" in a work session. It can, and one needs to recognize this and start over. With other agents, it was my routine experience that I'd be lucky to get an hour before having to restart my agent. A reliable way to induce this "cliff" is to let AI take on a much too hard problem in one step, then flail helplessly trying to fix their mess. Vibe-coding an unsuitable problem. One can even kill Opus 4 this way, but that's no way to run a race horse.
Some "persistence of memory" harness is as important as one's testing harness, for effective AI coding. With the right care having AI edit its own context prompts for orienting new sessions, this all matters less. AI is spectacularly bad at breaking problems into small steps without our guidance, and small steps done right can be different sessions. I'll regularly start new sessions when I have a hunch that this will get me better focus for the next step. So the cliff isn't so important. But Opus 4 is smarter in other ways.
People love to justify big expenses as necessary.
Sorry about the JS stuff I wrote this while also fooling around with alpine.js for fun. I never expected it to make it to HN. I'll get a static version up and running.
Happy to answer any questions or hear other thoughts.
Edit: https://static.jaysthoughts.com/
Static version here with slightly wonky formatting, sorry for the hassle.
Edit2: Should work on mobile now well, added a quick breakpoint.
I once had a member of my extended family who turned out to be a con artist. After she was caught, I cut off contact, saying I didn’t know her. She said “I am the same person you’ve known for ten years.” And I replied “I suppose so. And now I realized I have never known who that is, and that I never can know.”
We all assume the people in our lives are not actively trying to hurt us. When that trust breaks, it breaks hard.
No one who uses AI can claim “this is my work.” I don’t know that it is your work.
No one who uses AI can claim that it is good work, unless they thoroughly understand it, which they probably don’t.
A great many students of mine have claimed to have read and understand articles I have written, yet I discovered they didn’t. What if I were AI and they received my work and put their name on it as author? They’d be unable to explain, defend, or follow up on anything.
This kind of problem is not new to AI. But it has become ten times worse.
That said, I do think it would be nice for people to note in pull requests which files have AI gen code in the diff. It's still a good idea to look at LLM gen code vs human code with a bit different lens, the mistakes each make are often a bit different in flavor, and it would save time for me in a review to know which is which. Has anyone seen this at a larger org and is it of value to you as a reviewer? Maybe some tool sets can already do this automatically (I suppose all these companies report the % of code that is LLM generated must have one if they actually have these granular metrics?)
> The article opens with a statement saying the author isn't going to reword what others are writing, but the article reads as that and only that.
Hmm, I was just saying I hadn't seen much literature or discussion on trust dynamics in teams with LLMs. Maybe I'm just in the wrong spaces for such discussions but I haven't really come across it.
While on the other hand real nation-state threat actors would face no such limitations.
On a more general level, what concerns me isn't whether people use it to get utility out of it (that would be silly), but the power-imbalance in the hand of a few, and with new people pouring their questions into it, this divide getting wider. But it's not just the people using AI directly but also every post online that eventually gets used for training. So to be against it would mean to stop producing digital content.
I found out very early that under no circumstances you may have the code you don't understand, anywhere. Well, you may, but not in public, and you should commit to understanding it before anyone else sees that. Particularly before sales guys do.
However, AI can help you with learning too. You can run experiments, test hypotheses and burn your fingers so fast. I like it.
Never actually expected it to be posted on HN. Working on getting a static version up now.
The blog itself is using Alpine JS, which is a human-written framework 6 years ago (https://github.com/alpinejs/alpine), and you can see the result is not good.
Two completely unnecessary request to: jsdelivr.net and net.cdn.cloudflare.net
3 have obviously only read the title, and 3 comments how the article require JS.
Well played HN.
Otherwise please use the original title, unless it is misleading or linkbait.
This title counts as linkbait so I've changed it. It turns out the article is much better (for HN) than the title suggests.
Good change btw.
This often is brought up that if you don't use LLMs now to produce so-so code you will somehow magically completely fall off when the LLMs all of a sudden start making perfect code as if developers haven't been learning new tools constantly as the field as evolved. Yes, I use old technology, but also yes I try new technology and pick and choose what works for me and what does not. Just because LLMs don't have a good place in my work flow does not mean I am not using them at all or that I haven't tried to use them.
It might not solve every problem, but it solves enough of them better enough it belongs in the tool kit.
That comparison kind of makes my point though. Sure you can bury your face into Tik Tok for 12hrs a day and they do kind of suck at Excel but smartphones are massively useful and used tools by (approximately) everyone.
Someone not using a smartphone in this day and age is very fairly a 'luddite'.
A computer is a bicycle for the mind; an LLM is an easy-chair.
Most of the current discourse on AI coding assistants sounds either breathlessly optimistic or catastrophically alarmist. What’s missing is a more surgical observation: the disruptive effect of LLMs is not evenly distributed. In fact, the clash between how open source and industry teams establish trust reveals a fault line that’s been papered over with hype and metrics.
FOSS project work on a trust basis - but industry standard is automated testing, pair programming, and development speed. That CRUD app for finding out if a rental car is available? Not exactly in need for a hand-crafted piece of code, and no-one cares if Junior Dev #18493 is trusted within the software dev organization.
If the LLM-generated code breaks, blame gets passed, retros are held, Jira tickets multiply — the world keeps spinning, and a team fixes it. If a junior doesn’t understand their own patch, the senior rewrites it under deadline. It’s not pretty, but it works. And when it doesn’t, nobody loses “reputation” - they lose time, money, maybe sleep. But not identity.
LLMs challenge open source where it’s most vulnerable - in its culture. Meanwhile, industry just treats them like the next Jenkins: mildly annoying at first, but soon part of the stack.
The author loves the old ways, for many valid reasons: Gabled houses are beautiful, but outside of architectural circles, prefab is what scaled the suburbs, not timber joints and romanticism.
I understand the frustration: meaning reduced to metadata, debate replaced with reaction, and the richness of human thought lost in the echo of paraphrased content. If there is an exit to this timeline, I too would like to request the coordinates.
[ai]: rewrote the documentation ...
This is helps us to put another set of "glasses" as we later review the code.
If you use AI as tab-complete but it's what you would've done anyway, should you flag it? I don't know, plenty to think about when it comes to what the right amount of disclosure is.
I certainly wish that with our company, people could flag (particularly) large commits as coming from a tool rather than a person, but I guess the idea is that the person is still responsible for whatever the tool generates.
The problem is that it's incredibly enticing for over-worked engineers to have AI do large (ie. diffs) but boring tasks that they'd typically get very little recognition for (eg. ESLint migrations).
The HN submission has been editorialised since it was submitted, originally said "Yes, I will judge you for using AI..." and a lot of the replies early on were dismissive based on the title alone.