China tests hypersonic aircraft Mach 12 (theregister.com)
3 points by SanjayMehta 54m ago 1 comments
AI/ML and Workflow Engineer – Generative AI (Founding Team)
1 points by HeartStamp 1h ago 0 comments
The unreasonable effectiveness of an LLM agent loop with tool use
447 crawshaw 320 5/15/2025, 7:33:44 PM sketch.dev ↗
It is indeed astonishing how well a loop with an LLM that can call tools works for all kinds of tasks now. Yes, sometimes they go off the rails, there is the problem of getting that last 10% of reliability, etc. etc., but if you're not at least a little bit amazed then I urge you go to and hack together something like this yourself, which will take you about 30 minutes. It's possible to have a sense of wonder about these things without giving up your healthy skepticism of whether AI is actually going to be effective for this or that use case.
This "unreasonable effectiveness" of putting the LLM in a loop also accounts for the enormous proliferation of coding agents out there now: Claude Code, Windsurf, Cursor, Cline, Copilot, Aider, Codex... and a ton of also-rans; as one HN poster put it the other day, it seems like everyone and their mother is writing one. The reason is that there is no secret sauce and 95% of the magic is in the LLM itself and how it's been fine-tuned to do tool calls. One of the lead developers of Claude Code candidly admits this in a recent interview.[0] Of course, a ton of work goes into making these tools work well, but ultimately they all have the same simple core.
[0] https://www.youtube.com/watch?v=zDmW5hJPsvQ
It is expensive and slow to have an LLM use tools all the time for solving the problem. The next step is to convert frequent patterns of tool calls into a single pure function, performing whatever transformation of inputs and outputs are needed along the way (an LLM can help you build these functions), and then perhaps train a simple cheap classifier to always send incoming data to this new function, bypassing LLMs all together.
In time, this will mean you will use LLMs less and less, limiting their use to new problems that are unable to be classified. This is basically like a “cache” for LLM based problem solving, where the keys are shapes of problems.
The idea of LLMs running 24/7 solving the same problems in the same way over and over again should become a distant memory, though not one that an AI company with vested interest in selling as many API calls as possible will want people to envision. Ideally LLMs are only needed to be employed once or a few times per novel problem before being replaced with cheaper code.
Have you?
https://news.ycombinator.com/item?id=43984860
https://radanskoric.com/articles/coding-agent-in-ruby
We are using ruby to build a powerful AI toolset in the construction space, and we love how simple all of the SaaS parts are and not reinventing the wheel, but the ruby LLM SDK ecosystem is a bit lagging, so we've written a lot of our own low-level tools.
(btw we are also hiring rubyists https://news.ycombinator.com/item?id=43865448)
Pretty much WIP, but I am experimenting with simple sequence-based workflows that are designed to frequently reset the conversation [2]
This goes well with Microsoft paper "LLMs Get Lost In Multi-Turn Conversation " that was published Friday [1].
- [1]: https://arxiv.org/abs/2505.06120
- [2]: https://github.com/hbbio/nanoagent/blob/main/src/workflow.ts
Without more information I'm very skeptical that you had e.g. Claude Code create a whole app (so more than a simple script) with 20 cents. Unless it was able to one-shot it, but at that point you don't need an agent anyway.
They're a lot like a human in that regard, but we haven't been building that reflection and self awareness into them so far, so it's like a junior that doesn't realize when they're over their depth and should get help.
I constantly have to instruct them: - Go step by step, don't skip ahead until we're done with a step - Don't make assumptions, if you're unsure ask questions to clarify
And they mostly do this.
But this needs to be default behavior!
I'm surprised that, unless prompted, LLMs never seem to ask follow-up questions as a smart coworker might.
Once I'm happy that the readme accurately reflects what I want to build and all the architectural/technical/usage challenges have been addressed, I let the agent rip, instructing it to build one thing at a time, then typecheck, lint and test the code to ensure correctness, fixing any errors it finds (and re-running automated checks) before moving on to the next task. Given this workflow I've built complex software using agents with basically no intervention needed, with the exception of rare cases where its testing strategy is flakey in a way that makes it hard to get the tests passing.
Just curious, could you expand on the precise tools or way you do this?
For example, do you use the same well-crafted prompt in Claude or Gemini and use their in-house document curation features, or do you use a file in VS Code with Copilot Chat and just say "assist me in writing the requirements for this project in my README, ask questions, perform a socratic discussion with me, build a roadmap"?
You said you had 'great success' and I've found AI to be somewhat underwhelming at times, and I've been wondering if it's because of my choice of models, my very simple prompt engineering, or if my inputs are just insufficient/too complex.
# Cursor Rules for This Project
## Overview ## Engineering Mindset- Prioritize *clarity, simplicity, robustness, and extensibility*. - Solve problems thoughtfully, considering the long-term maintainability of the code. - Challenge assumptions and verify problem understanding during design discussions. - Avoid cleverness unless it significantly improves readability and maintainability. - Strive to make code easy to test, easy to debug, and easy to change.
## Design First
- Before coding, establish a clear understanding of the problem and the proposed solution. - When designing, ask: - What are the failure modes? - What will be the long-term maintenance burden? - How can this be made simpler without losing necessary flexibility? - Update documentation during the design phase: - `README.md` for project-level understanding. - Architecture diagrams (e.g., PlantUML, D2) are encouraged for complex flows.
I use auto lint/test in aider like so:
file: - README.md - STYLEGUIDE.md - .cursorrules
aiderignore: .gitignore
# Commands for linting, typechecking, testing lint-cmd: - bun run lint - bun run typecheck
test-cmd: bun run test
Since you shared yours, it's only fair to share mine :). In my current projects, two major files I use are:
[[ CONVENTIONS.md ]] -- tends to be short and project-specifics; looks like this:
Project conventions
- Code must run entirely client-side (i.e. in-browser)
- Prefer solutions not requiring a build step - such as vanilla HTML/JS/CSS
- Minimize use of dependencies, and vendor them
[[ AI.md ]] -- this I guess is similar to what people put in .cursorrules; mine looks like this:# Extra AI instructions Here are stored extra guidelines for you.
## AI collaborative project
I'm relying on you to do a good job here and I'm happy to embrace the directions you're giving, but I'll be editing it on my own as well.
## Evolving your instruction set
If I tell you to remember something, behave differently, or you realize yourself you'd benefit from remembering some specific guideline, please add it to this file (or modify existing guideline). The format of the guidelines is unspecified, except second-level headers to split them by categories; otherwise, whatever works best for you is best. You may store information about the project you want to retain long-term, as well as any instructions for yourself to make your work more efficient and correct.
## Coding Practice Guidelines
Strive to adhere to the following guidelines to improve code quality and reduce the need for repeated corrections:
## Project platform noteThis project is targeting a Raspberry Pi 2 Model B V1.1 board with a 3.5 inch TFT LCD touchscreen sitting on top. That touchscreen is enabled/configured via system overlay and "just works", and is currently drawn to via framebuffer approach.
Keep in mind that the Rapsberry Pi board in question is old and can only run 32-bit code. Relevant specs:
The board is dedicated to running this project and any supplementary tooling. There's a Home Assistant instance involved in larger system to which this is deployed, but that's running on a different board.## Project Files and Structure
This section outlines the core files of the project.
<<I let the AI put its own high-level "repo map" here, as recently, I found Aider has not been generating any useful repo maps for me for unknown reasons.>>
-------
This file ends up evolving from project to project, and it's not as project-independent as I'd like; I let AI add guidelines to this file based on a discussion (e.g. it's doing something systematically wrong and I point it out and tell it to remember). Also note that a lot of guidelines is focused on keeping projects broken down into a) lots of files, to reduce context use as it grows, and b) small, well-structured files, to minimize the amount of broken SEARCH/REPLACE diff blocks; something that's still a problem with Aider for me, despite models getting better.
I usually start by going through the initial project ideas in "ask mode", then letting it build the SPECIFICATION.md document and a PLAN.md document with a 2-level (task/subtask) work breakdown.
And yes, I've been using gemini for the past month or two - ever since gemini-2.5-pro came out and topped the Aider benchmark. It's good, but it sure does comment excessively, including in places like quoted scripts, where those comments are a syntax error...
I've tried the current top combo from Aider's benchmarks last night - that is, o3 (high) architect + GPT-4.1 editor. It's refreshingly concise, generates much smaller diffs, but man does it burn through money - a factor 16x relative to gemini-2.5-pro-preview-05-06. Not sure if it's worth it.
If you feel those tips are good then you are just a bad judge of tips, there is a reason self help books sell so well even though they don't really help anyone, their goal is to write a lot of tips that sound good since they are kind of vague and general but doesn't really help the reader.
I'm sorry if you're using it wrong.
Eventually it'll do something wrong or I realize I wanted things differently, which necessitates some further conversation, but other than that, it's just "go on" until we run out of plan, then devising a new plan, rinse repeat.
Only on this website of completely reality detached individuals such an obvious comment would be needed.
Maybe consider if you don't find it useful you're working on problems that it's not good at, or even more likely, you just suck at using the tools.
Anybody that finds value out of LLMs has a hard time understanding how one would conclude they are useless and you can't "give it instructions because that's that hard part" but it's actually really easy to understand. The folks that think this are just bad at it. We aren't living in some detached reality. The reality is that some people are just better than others
Explains a lot about software quality these days.
They also added the first pass of multi-monitor support for my WM while I was using it (restarted it repeatedly while Claude Code worked, in the same X session the terminal it was working in was running).
You do need to reign them back in, sure, but they can often go multiple iterations before they're ready to make changes to your files once you've approved safe tool uses etc.
Just like humans and human organisations also tend to experience drift, unless anchored in reality.
1. clickclickclick - A framework to let local LLMs control your android phone (https://github.com/BandarLabs/clickclickclick)
Longer term, I don't think this holds due to the nature of capitalism.
If given a choice between paying for an LLM to do something that's mostly correct versus paying for a human developer, businesses are going to choose the former, even if it results in accelerated enshittification. It's all in service of reducing headcount and taking control of the means of production away from workers.
[0] https://github.com/The-Pocket/PocketFlow-Tutorial-Cursor/blo...
I thoroughly enjoyed his “writing an interpreter”. I guess I’m going to build an agent now.
No comments yet
Makes that "$3 billion" valuation for Windsurf very suspect
The price tag is hefty but I figure it'll work out for them on the backside because they won't have to fight so hard to capture TAM.
In my experience, that next 9% will take 9 times the effort.
And that next 0.9% will take 9 times the effort.
And so on.
So 90% is very far off from 99.999% reliability. Which would still be less reliable than an ec2 instance.
It didn't go well. I started with 4o:
- It used a deprecated package.
- After I pointed that out, it didn't update all usages - so I had to fix them manually.
- When I suggested a small logic change, it completely broke the syntax (we're talking "foo() } return )))" kind of broken) and never recovered. I gave it the raw compilation errors over and over again, but it didn't even register the syntax was off - just rewrote random parts of the code instead.
- Then I thought, "maybe 4.1 will be better at coding" (as advertized). But 4.1 refused to use the canvas at all. It just explained what I could change - as in, you go make the edits.
- After some pushing, I got it to use the canvas and return the full code. Except it didn't - it gave me a truncated version of the code with comments like "// omitted for brevity".
That's when I gave up.
Do agents somehow fix this? Because as it stands, the experience feels completely broken. I can't imagine giving this access to bash, sounds way too dangerous.
That's because models have training cut-off dates. It's important to take those into account when working with them: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#a...
I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.
You can tell it "look up the most recent version of library X and use that" and it will often work!
I even used it for a frustrating upgrade recently - I pasted in some previous code and prompted this:
This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.
It did exactly what I asked: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...
When I pointed out that it used a deprecated package, it agreed and even cited the correct version after which it was deprecated (way back in 2021). So it knows it's deprecated, but the next-token prediction (without reasoning or tools) still can't connect the dots when much of the training data (before 2021) uses that package as if it's still acceptable.
>I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.
Thanks for the tip!
That is such a useful distinction. I like to think I'm keeping up with this stuff, but the '4o' versus 'o4' still throws me.
Which is precisely the issue with the idea of LLMs completely replacing human engineers. It doesn't understand this context unless a human tells it to understand that context.
https://aider.chat/docs/leaderboards/
I think if I were giving access to bash, though, it would definitely be in a docker container for me as well.
A good programmer with AI tools will run circles around a good programmer without AI tools.
I'm not totally convinced that we won't see a similar effect here, with some really competitive coders 100% eschewing LLMs and still doing as well as the best that use them.
No, they didn't.
You can get vim and Emacs on par with IDEs[0] somewhat easily thanks to Language Server Protocol. You can't turn them into "fighter jets" without "super-heavyweight LLMs" because that's literally what, per GP, makes an editor/IDE a fighter jet. Yes, Emacs has packages for LLM integration, and presumably so does Vim, but the whole "fighter jet vs. bicycle" is entirely about SOTA LLMs being involved or not.
--
[0] - On par wrt. project-level features IDEs excel at; both editors of course have other aspects that none of the IDEs ever come close to.
Maybe we will all use LLMs one day in neovim too. :)
My success ratio still isn't very high, but for certain easy tasks, I'll let an LLM take a crack at it.
I'm not really convinced by this. Easy problems rarely take up much of my time.
After pointing out the bugs to the LLM, it successfully debugged them (with my help/feedback, i.e. I provided the output of the debug messages it had added to the code) and ultimately fixed them. The only downside was that I wasn't quite happy with the quality of the fixes – they were more like dirty hacks –, but oh well, after another round or two of feedback we got there, too. I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.
This x 100. I get so much better quality code if I have LLMs review each other's code and apply corrections. It is ridiculously effective.
One LLM reviews existing code and the new requirement and then creates a PRD. I usually use Augment Code for this because it has a good index of all local code.
I then ask Google Gemini to review the PRD and validate it and find ways to improve it. I then ask Gemini to create a comprehensive implementation plan. It frequently creates a 13 step plan. It would usually take me a month to do this work.
I then start a new session of Augment Code, feed it the PRD and one of the 13 tasks at a time. Whatever work it does, it checks it in a feature branch with detailed git commit comment. I then ask Gemini to review the output of each task and provide feedback. It frequently finds issues with implementation or areas of improvement.
All of this managed by using git. I make LLMs use git. I think would go insane if I had to copy/paste this much stuff.
I have a recipe of prompts that I copy/paste. I am trying to find ways to cut that down and making slow progress in this regard. There are tools like "Task Master" (https://github.com/eyaltoledano/claude-task-master) that do a good job of automating this workflow. However this tool doesn't allow much customization. e.g. Have LLMs review each other's work.
But, maybe I can get LLMs to customize that part for me...
My best results are usually with 4o-mini-high, o3 is sometimes pretty good
I personally don’t like the canvas. I prefer the output on the chat
And a lot of times I say: provide full code for this file, or provide drop-in replacement (when I don’t want to deal with all the diffs). But usually at around 300-400 lines of code, it starts getting bad and then I need to refactor to break stuff up into multiple files (unless I can focus on just one method inside a file)
For coding, I stick to Claude 3.5 / 3.7 and recently Gemini 2.5 Pro. I sometimes use o3 in ChatGPT when I can't be arsed to fire up Aider, or really need to use its search features to figure out how to do something (e.g. pinouts for some old TFT screens for ESP32 and Raspberry Pi, most recently).
o1-pro, o1-preview can generate updated full file responses into the 1k LOC range.
It's something about their internal verification methods that make it an actual viable development method.
It's interesting how the same model being served through different interfaces (chat vs api), can behave differently based on the economic incentives of the providers
Switch to Claude (IMSHO, I think Gemini is considered on par). Use a proper coding tool, cutting & pasting from the chat window is so last week.
It's somewhat ironic the more behind the leading edge you are, the more efficient it is to make the gains eventually because you don't waste time on the micro-gain churn, and a bigger set of upgrades arrives when you get back on the leading edge.
I watched this dynamic play out so many times in the image generation space with people spending hundreds of hours crafting workflows to get around deficiencies in models, posting tutorials about it, other people spending all the time to learn those workflows. New model comes out and boom, all nullified and the churn started all over again. I eventually got sick of the churn. Batching the gains worked better.
Devs have been doing micro changes to their setup for 50 years. It is the nature of their beast.
In my world, they were given 9 years to switch to Python 3 even if you write off 3.0 and 3.1 as premature, and they still missed by years, and loudly complained afterwards.
And they still can't be bothered to learn what a `pyproject.toml` is, let alone actually use it for its intended purpose. One of the most popular third-party Python libraries (Requests), which is under stewardship by the PSF, which uses only Python code, had its "build" (no compilation - purely a matter of writing metadata, shuffling some files around and zipping it up) broken by the removal of years-old functionality in Setuptools that they weren't even actually remotely reliant upon. Twice, in the last year.
It takes me ~1 week to merge small fixes to their build system (which they don't understand anyway so they just approve whatever).
Giving sharp knives to monkeys would be another.
Most people would guess it threatens their identity. Sensitive intellectuals who found a way to feel safe by acquiring deep domain-specific expertise suddenly feel vulnerable.
In addition, a programmer's job, on the whole, has always been something like modelling the world in a predictable way so as to minimise surprise.
When things change at this rate/scale, it also goes against deep rooted feelings about the way things should work (they shouldn't change!)
Change forces all of us to continually adapt and to not rest on our laurels. Laziness is totally understandable, as is the resulting anger, but there's no running away from entropy :}
For context: we're specifically discussing vibe coding, not AI or LLMs.
With that in mind, do you think any of the rest of your comment is on-topic?
Not every software developer is hired to do trivial frontend work.
This is a very important flaw that you should probably seek to correct.
I'm not talking about LLMs, which I use and consider useful, I'm specifically talking about vibe coding, which involves purposefully not understanding any of it, just copying and pasting LLM responses and error codes back at it, without inspecting them. That's the description of vibe coding.
The analogy with "monkeys with knives" is apt. A sharp knife is a useful tool, but you wouldn't hand it to an unexperienced person (a monkey) incapable of understanding the implications of how knives cut.
Likewise, LLMs are useful tools, but "vibe coding" is the dumbest thing ever to be invented in tech.
> OBVIOUSLY working
"Obviously working" how? Do you mean prototypes and toy examples? How will these people put something robust and reliable in production, ever?
If you meant for fun & experimentation, I can agree. Though I'd say vibe coding is not even good for learning because it actively encourages you not to understand any of it (or it stops being vibe coding, and turns into something else). It's that what you're advocating as "obviously working"?
Could an experienced person/dev vibe code?
> "Obviously working" how? Do you mean prototypes and toy examples? How will these people put something robust and reliable in production, ever?
You really don't think AI could generate a working CRUD app which is the financial backbone of the web right now?
> If you meant for fun & experimentation, I can agree. Though I'd say vibe coding is not even good for learning because it actively encourages you not to understand any of it (or it stops being vibe coding, and turns into something else). It's that what you're advocating as "obviously working"?
I think you're purposefully reducing the scope of what vibe coding means to imply it's a fire and forget system.
Sure, but why? They already paid the price in time/effort of becoming experienced, why throw it all away?
> You really don't think AI could generate a working CRUD app which is the financial backbone of the web right now?
A CRUD? Maybe. With bugs and corner cases and scalability problems. A robust system in other conditions? Nope.
> I think you're purposefully reducing the scope of what vibe coding means to imply it's a fire and forget system.
It's been pretty much described like that. I'm using the standard definition. I'm not arguing against LLM-assisted coding, which is a different thing. The "vibe" of vibe coding is the key criticism.
You spend 1/10 amount of time doing something, you have 9/10 of that time to yourself.
> A CRUD? Maybe. With bugs and corner cases and scalability problems. A robust system in other conditions? Nope.
Now you're just inventing stuff. "scalability problems" for a CRUD app. You obviously haven't used it. If you know how to prompt the AI it's very good at building basic stuff, and more advanced stuff with a few back and forth messages.
> It's been pretty much described like that. I'm using the standard definition. I'm not arguing against LLM-assisted coding, which is a different thing. The "vibe" of vibe coding is the key criticism.
By whom? Wikipedia says
> Vibe coding (or vibecoding) is an approach to producing software by depending on artificial intelligence (AI), where a person describes a problem in a few sentences as a prompt to a large language model (LLM) tuned for coding. The LLM generates software based on the description, shifting the programmer's role from manual coding to guiding, testing, and refining the AI-generated source code.[1][2][3] Vibe coding is claimed by its advocates to allow even amateur programmers to produce software without the extensive training and skills required for software engineering.[4] The term was introduced by Andrej Karpathy in February 2025[5][2][4][1] and listed in the Merriam-Webster Dictionary the following month as a "slang & trending" noun.[6]
Emphasis on "shifting the programmer's role from manual coding to guiding, testing, and refining the AI-generated source code" which means you don't blindly dump code into the world.
I have used AI/LLMs; in fact I use them daily and they've proven helpful. I'm talking specifically about vibe coding, which is dumb.
> By whom? [...] Emphasis on "shifting the programmer's role from manual coding to guiding, testing, and refining the AI-generated source code" which means you don't blindly dump code into the world.
By Andrej Karpathy, who popularized the term and describes it as mostly blindly dumping code into the world:
> There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
He even claims "it's not too bad for throwaway weekend projects", not for actual production-ready and robust software... which was my point!
Also see Merriam-Webster's definition, mentioned in the same Wikipedia article you quoted: https://www.merriam-webster.com/slang/vibe-coding
> Writing computer code in a somewhat careless fashion, with AI assistance
and
> In vibe coding the coder does not need to understand how or why the code works, and often will have to accept that a certain number of bugs and glitches will be present.
and, M-W quoting the NYT:
> You don’t have to know how to code to vibecode — just having an idea, and a little patience, is usually enough.
and, quoting from Ars Technica
> Even so, the risk-reward calculation for vibe coding becomes far more complex in professional settings. While a solo developer might accept the trade-offs of vibe coding for personal projects, enterprise environments typically require code maintainability and reliability standards that vibe-coded solutions may struggle to meet.
I must point out this is more or less the claim I made and which you mocked with your CRUD remarks.
You're adding "badly" like it's a fact when it is not. Again, in my experience, in the experience of people around me and many experiences of people online AI is more than capable of doing "simpler" stuff on its own.
> By Andrej Karpathy, who popularized the term
Nowhere in your quoted definitions does it say you don't *ever* look at the code. MW says non-programmers can vibe code, also in a "somewhat careless fashion" none of those imply you CANNOT look at code for it to be vibe coding. If Andrej didn't look at it it doesn't mean the definition is that you are not to look at it.
> which you mocked with your CRUD remarks
I mocked nothing, I just disagree with you since as a dev with over 10 years of experience I've been using AI for both my job and personal projects with great success. People that complain about AI expect it to parse "Make an ios app with stuff" successfully, and I am sure it will at some point, but now it requires finer grain instructions to ensure its success.
I've vibe-coded completely functional mobile apps, and used a handful LLMs to augment my development process in desktop applications.
From that experience, I understand why parsing metrics from this practice is difficult. Really, all I can say is that codegen LLMs are too slow and inefficient for my workflow.
Saying that as I’ve got vibe coded react internal tooling used in production without issues, saved days of work easily.
Vibe coding as was explained by the popularizer of the term involves no coding. You just paste error messages, paste the response of the LLM, paste the error messages back, paste the response, and pray that after several iterations the thing converges to a result.
It involves NOT looking at either the LLM output or the error messages.
Maybe you're using a different definition?
Horror stories from newbies launching businesses and getting their data stolen because they trust models are to be expected, but I would not call them vibe coding horror stories, since there is no coding involved even by proxy, it's copy pasting on steroids. Blind copy pasting from stack overflow was not coding for me back then either. (A minute of silence for SO here. RIP.)
For example, another person in this thread argues:
> I'd rather give my green or clueless or junior or inexperienced devs said knives than having them throw spaghetti on a wall for days on end, only to have them still ask a senior to help or do the work for them anyways.
So they are clearly not talking about experienced coders. They are also completely disregarding the learning experience any junior coder must go through in order to become an experienced coder.
This is clearly not what you're arguing though. So which "vibe coding" are we discussing? I know which one I meant when I spoke of monkeys and sharp knives...
He seems to think it barely involves coding ("I don't read the diffs anymore, I Accept All [...] It's not really coding"), and that it's only good for goofing and throwaway code...
Normal programming is like walking, deliberate and sure. Vibe coding is like surfing, you can't control everything, just hit yes on auto. Trust the process, let it make mistakes and recover on its own.
Definitely, I ask for a plan and then, even if it's obvious, I ask questions and discuss it. I also point it as samples of code that I like with instructions for what is good about it.
Once we have settled on a plan, I ask it to break it into phases that can be tested (I am not one for a unit testing) to lock in progress. Claude LOVES that. It organizes a new plan and, at the end of each phase tells me how to test (curl, command line, whatever is appropriate) and what I should see that represents success.
The most important thing I have figured out is that Claude is a collaborator, not a minion. I agree with visarga, it's much more like surfing that walking. Also, Trust... but Verify.
This is a great time to be a programmer.
I started with GPT which gave mediocre results, then switched to claude which was a step function improvement - but again grinded when complexity got a bit high. Main problem was after a certain size it did not give good ways break down your projects.
Then I switched to Gemini. This has blown my mind away. I dont even use cursor etc. Just plain old simple prompts and summarization and regular refactoring and it handles itself pretty well. I must have generated 30M tokens so far (in about 3 weeks) and less 1% of "backtracking" needed. i define backtracking as your context has gone so wonky that you have to start all over again.
- It's very good at writing new code
- Once it goes wrong, there is no point in trying to give it more context or corrections. It will go wrong again or at another point.
- It might help you fix an issue. But again, either it finds the issue the first time, or not at all.
I treat my LLM as a super quick junior coder, with a vast knowledge base stored inside its brain. But it's very stubborn and can't be helped to figure out a problem it wasn't able to solve in the first try.
In this case, o3 is the architect and 4.1 is the editor.
It should one-shot this. I’ve run complex workflows and the time I save is astonishing.
I only run agents locally in a sandbox, not in production.
I've had limited success by prompting the latest OpenAI models to disregard every previous instruction they had about limiting their output and keep writing until the code is completed. They quickly forget,so you have to keep repeating the instruction.
If you're a copilot user, try Claude.
LLMs don't have up to date knowledge of packages by themselves that's a bit like buying a book and expecting it to have up to date world knowledge, you need to supplement / connect it to a data source (e.g. web search, documentation and package version search etc.).
I find it's more useful if you start with a fresh chat and use the knowledge you have gained: "Use package foo>=1.2 with the FooBar directive" is more useful than "no, I told you to stop using that!"
It's like repeatedly telling you to stop thinking about a pink elephant.
You set yourself up to fail from the get go. But understandable. If you don't have a lot of experience in this space, you will struggle with low quality tools and incorrect processes. But, if you stick with it, you will discover better tools and better processes.
The worst is when I ask something complex, the model generates 300 lines of good code and then timeouts or crashes. If I ask to continue it will mess up the code for good, eg. starts generating duplicated code or functions which don't match the rest of the code.
I regularly generate and run in the 600-1000LOC range.
Not sure you would call it "vibe coding" though as the details and info you provide it and how you provide it is not simple.
I'd say realistically it speeds me up 10x on fresh greenfield projects and maybe 2x on mature systems.
You should be reading the code coming out. The real way to prevent errors is read the resoning and logic. The moment you see a mistep go back and try the prompt again. If that fails try a new session entirely.
Test time compute models like o1-pro or the older o1-preview are massively better at not putting errors in your code.
Not sure about the new claude method but true, slow test time models are MASSIVELY better at coding.
Having full control over inputs and if something goes wrong starting a new chat with either narrower scope or clearer instructions is basically AGI level work.
There is nobody but a human for now that can determine how bad an LLM actually screwed up its logic train.
But maybe you mean pure UI?
I could forsee something like a new context creation button that gives a nice UI of what to bring over and what to ditch from the UI as pretty nice.
Maybe like a git diff looking method? Drop this paragraph bring this function by just simple clicks would be pretty slick!
I deffinetly see a future of better cross chat context connections and information being powerful. Basically git but for every conversation and cpde generated for a project.
Would be crazy hard but also crazy powerful.
If my startups blows up I might try something like that!
But is definitely a learning process for you.
The LLM needs context.
https://github.com/marv1nnnnn/llm-min.txt
The LLM is a problem solver but not a repository of documentation. Neural networks are not designed for that. They model at a conceptual level. It still needs to look up specific API documentation like human developers.
You could use o3 and ask it to search the web for documentation and read that first, but it's not efficient. The professional LLM coding assistant tools manage the context properly.
The fact that you're using 4o and 4.1 rather than claude is already a huge mistake in itself.
> Because as it stands, the experience feels completely broken
Broken for you. Not for everyone else.
The trick isn't new - I first encountered it with the ReAcT paper two years ago - https://til.simonwillison.net/llms/python-react-pattern - and it's since been used for ChatGPT plugins, and recently for MCP, and all of the models have been trained with tool use / function calls in mind.
What's interesting today is how GOOD the models have got at it. o3/o4-mini's amazing search performance is all down to tool calling. Even Qwen3 4B (2.6GB from Ollama, runs happily on my Mac) can do tool calling reasonably well now.
I gave a workshop at PyCon US yesterday about building software on top of LLMs - https://simonwillison.net/2025/May/15/building-on-llms/ - and used that as an excuse to finally add tool usage to an alpha version of my LLM command-line tool. Here's the section of the workshop that covered that:
https://building-with-llms-pycon-2025.readthedocs.io/en/late...
My LLM package can now reliably count the Rs in strawberry as a shell one-liner:
I'm sure it is much the same as this under the hood though Anthropic has added many insanely useful features.
Nothing is perfect. Producing good code requires about the same effort as it did when I was running said team. It is possible to get complicated things working and find oneself in a mess where adding the next feature is really problematic. As I have learned to drive it, I have to do much less remediation and refactoring. That will never go away.
I cannot imagine what happened to poor kgeist. I have had Claude make choices I wouldn't and do some stupid stuff, never enough that I would even think about giving up on it. Almost always, it does a decent job and, for a most stuff, the amount of work it takes off of my brain is IMMENSE.
And, for good measure, it does a wonderful job of refactoring. Periodically, I have a session where I look at the code, decide how it could be better and instruct Claude. Huge amounts of complexity, done. "Change this data structure", done. It's amazingly cool.
And, just for fun, I opened it in a non-code archive directory. It was a junk drawer that I've been filling for thirty years. "What's in this directory?" "Read the old resumes and write a new one." "What are my children's names?" Also amazing.
And this is still early days. I am so happy.
Claude was able to handle all of these tasks simultaneously, so I could see how small changes at either end would impact the intermediate layers. I iterated on many ideas until I settled on the best overall solution for my use case.
Being able to iterate like that through several layers of complexity was eye-opening. It made me more productive while giving me a better understanding of how the different components fit together.
Yeah this is literally just so enjoyable. Stuff that would be an up-hill battle to get included in a sprint takes 5 minutes. It makes it feel like a whole team is just sitting there, waiting to eagerly do my bidding with none of the headache waiting for work to be justified, scheduled, scoped, done, and don't even have to justify rejecting it if I don't like the results.
Elon?
Other issue is that LLM's can go off on a tangent. As context builds up, they forget what their objective was. One wrong turn, and in the rabbit hole they go never to recover.
The reason I know, is because we started solving these problems an year back. And we aren't done yet. But we did cover a lot of distance.
[Plug]: Try it out at https://nonbios.ai:
- Agentic memory → long-horizon coding
- Full Linux box → real runtime, not just toy demos
- Transparent → see & control every command
- Free beta — no invite needed. Works with throwaway email (mailinator etc.)
I think this is probably at the heart of the best argument against these things as viable tools.
Once you have sufficiently described the problem such that the LLM won't go the wrong way, you've likely already solved most of it yourself.
Tool use with error feedback sounds autonomous but you'll quickly find that the error handling layer is a thin proxy for the human operator's intentions.
Especially in GUI development, building forms, charts, etc.
I could imagine that LLMs are a great help here.
Everything happening in the LLM space is so close to how humans think naturally.
1. Custom MCP server to work on linux command line. This wasn't really a 'MCP' server because we started working on it before MCP was a thing. But thats the easiest way to explain it now. The MCP server is optimised to reduce context.
2. Guardrails to reduce context. Think about it as prompt alterations giving the LLM subtle hints to work with less context. The hints could be at a behavioural level and a task level.
3. Continuously pruning the built up context to make the Agent 'forget'. Forgetting what is not important is what we believe a foundational capability.
This is kind of inspired by the science which says humans use sleep to 'forget' not useful memories and is critical to keeping the brain healthy. This translates directly to LLM's - making them forget is critical to keep them focussed on the larger task and their actions alligned.
It has been a bit like herding cats sometimes, it will run away with a bad idea real fast, but the more constraints I give it telling it what to use, where to put it, giving it a file for a template, telling it what not to do, the better the results I get.
In total it's given me 3500 lines of test code that I didn't need to write, don't need to fix, and can delete and regenerate if underlying assumptions change. It's also helped tune difficulty curves, generate mission variations and more.
Here is a wild idea. Imagine running a companion, policy-enforcing LLM, independently and in parallel, which is given instructions to keep the main LLM behaving according to instructions.
If the companion LLM could - in real time - ban the coding LLM from emitting "let's just skip it" by seeing the tokens "let's just" and then biasing the output such that the word "skip" becomes impossible to emit.
Banning the word "skip" from following "let's just", forces the LLM down a new path away from the undesired behavior.
It's like Structured Outputs or JSON mode, but driven by a companion LLM, and dynamically modified in real time as tokens are emitted.
If the idea works, you could prompt the companion LLM to do more advanced stuff - eg. ban a coding LLM from making tests pass by deleting the test code, ban it from emitting pointless comments... all the policies that we put into system prompts today and pray the LLM will do, would go into the companion LLM's prompt instead.
Wonder what the Outlines folks think of this!
Of course doing that limits which model providers you can work with (notably, OpenAI has gotten quite hostile to power users doing stuff like that over the past year or so).
Kind of seems an optimization: if the “token ban” is a tool call, you can see that being too slow to run for every token. Provided rewinding is feasible, your idea could make it performant enough to be practical.
Consider: "Let's just skip writing this test, as the other change you requested seems to eliminate the need for it."
Rolling back the model on "Let's just" would be stupid; rolling it on "Let's just skip writing this test" would be stupid too, as is the beliefs that writing tests is your holy obligation to your god, and you must do so unconditionally. The following context makes it clear that the decision is fine. Or, if you (or the governor agent) don't buy the reasoning, you're then in a perfect position to say, "nope, let's roll back to <Let's> and bias against ["skip, test"]".
Checking the entire message once makes sense; checking it after every token doesn't.
> To skip all tests, we can define a `pytest_runtest_setup` function that always skips.
to saying:
> Idea: Starting a new build with a random first succession, calling random_new_succession should not increment succession, but in implementation, it does. Adjust to prevent increment. Implement fix to random_new_build not to increment succession, or test adapted to behavior.
while then doing exactly the same thing (skipping the tests)
Even without training, it's only a temporary band-aid. If the incentive for reward-hacking becomes high enough, it will simply start phrasing it in different, not-possible-to-detect ways.
[1]: https://openai.com/index/chain-of-thought-monitoring/
Certainly human judges, attorneys for defense and prosecution, and members of the jury can still perform their jobs well even if they attended the same primary and secondary schools.
[1]: https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...
[2]: https://www.hep.upenn.edu/~johnda/Papers/wignerUnreasonableE...
[1] https://karpathy.github.io/2015/05/21/rnn-effectiveness/
Wigner's essay is about how the success of mathematics in being applied to physics, sometimes years after the maths and very unexpectedly, is philosophically troubling - it is unreasonably effective. Whereas this blog post is about how LLM agents with tools are "good". So it was not just a catchy title, although yes, maybe it is now beibg reduced to that.
b. Wigner's original essay is a serious piece, and quite beautiful in its arguments. I had been under the impression that the phrasing had been used a few times since, but typically by other serious people who were aware of the lineage of that lovely essay. With this 6-paragraph vibey-blog-post, it truly has become a meme. So it goes, I suppose.
Terrifying. LLMs are very 'accommodating' and all they need is someone asking them to do something. This is like SQL injection, but worse.
3.5 is better for this, ime. I hooked claude desktop up to an MCP server to fake claude-code less the extortionate pricing and it works decently. I've been trying to apply it for rust work; it's not great yet (still doesn't really seem to "understand" rust's concepts) but can do some stuff if you make it `cargo check` after each change and stop it if it doesn't.
I expect something like o3-high is the best out there (aider leaderboards support this) either alone or in combination with 4.1, but tbh that's out of my price range. And frankly, I can't mentally get past paying a very high price for an LLM response that may or may not be useful; it leaves me incredibly resentful as a customer that your model can fail the task, requiring multiple "re-rolls", and you're passing that marginal cost to me.
I’m finding it useful for really tedious stuff like doing complex, multi step terminal operations. For the coding… it’s not been great.
It also depends a lot on the mix of model and type of code and libraries involved. Even in different days the models seem to be more or less capable (I’m assuming they get throttled internally - this is very noticeable sometimes in how they try to save on output tokens and summarize the code responses as much as possible, at least in the chat/non-api interfaces)
I’ve been looking for something that can take “bare diffs” (unified diffs without line numbers), from the clipboard and then apply them directly on a buffer (an open file in vscode)
None of the paste diff extension for vscode work, as they expect a full unified diff/patch
I also tried a google-developed patch tool, but also wasn’t very good at taking in the bare diffs, and def couldn’t do clipboard
This is src/components/Foo.tsx
```tsx // code goes here ```
OR
```tsx // src/components/Foo.tsx // code goes here ```
These seem to work the best.
I tried diff syntax, but Gemini 2.5 just produced way too many bugs.
I also tried using regex and creating an AST of the markdown doc and going from there, but ultimately settled on calling gpt-4.1-mini-2025-04-14 with the beginning of the code block (```) and 3 lines before and 3 lines after the beginning of the code block. It's fast/cheap enough to work.
Though I still have to make edits sometimes. WIP.
No comments yet
o1-pro and o1-preview are the only models I've ever used that can reliably update and work with 1000 LOC without error.
I don't let o3 write any code unless it's very small. Any "cheap" model will hallucinate or fail massively when pushed.
One good tip I've done lately. Remove all comments in your code before passing or using LLMs, don't let LLM generated comments persist under any circumstance.
I wouldn't be shocked if huge, expensive-to-run models performed better and if all the "optimized" versions were actually labs trying to ram cheaper bullshit down everyone's throat. Basically chinesium for LLMs; you can afford them but it's not worth it. I remember someone saying o1 was, what, 200B dense? I might be misremembering.
o1-preview was and possibly still is the most powerful model they ever released. I only switched to pro for coding after months of them improving it and my api bill getting a bit crazy (like 0.50$ per question).
I don't think paramater count matters anymore. I think the only thing that matters is how much compute a vendor will give you per question.
Never an agent, every independent step an LLM takes is dangerous. My method is much more about taking the largest and safest single step at a time possible. If it can't do it in one step I narrow down until it can.
I guess it can't really be run locally https://www.reddit.com/r/LocalLLaMA/comments/1kgyfif/introdu...
What's your concern? An accident or an attacker? For accidents, I use git and backups and develop in a devcontainer. For an attacker, bash just seems like an ineffective attack vector; I would be more worried about instructing the agent to write a reverse shell directly into the code.
I.e. exposing any of these on a public network is the main target to get a foothold in a non-public network or a computer. As soon as you have that access you can start renting out CPU cycles or use it for distributed hash cracking or DoS-campaigns. It's simpler than injecting your own code and using that as a shell.
Asking a few of my small local models for Forth-like interpreters in x86 assembly they seem willing to comply and produce code so if they had access to a shell and package installation I imagine they could also inject such a payload into some process. It would be very hard to discover.
All it needs to do is curl and run the actual payload.
Han Xiao at Jina wrote a great article that goes into a lot more detail on how to turn this into a production quality agentic search: https://jina.ai/news/a-practical-guide-to-implementing-deeps...
This is the same principle that we use at Brokk for Search and for Architect. (https://brokk.ai/)
The biggest caveat: some models just suck at tool calling, even "smart" models like o3. I only really recommend Gemini Pro 2.5 for Architect (smart + good tool calls); Search doesn't require as high a degree of intelligence and lots of models work (Sonnet 3.7, gpt-4.1, Grok 3 are all fine).
That’s a pretty bold claim, how come you are not at the top of this list then? https://www.swebench.com/
“Use frontier models like o3, Gemini Pro 2.5, Sonnet 3.7” Is this unlimited usage? Or number of messages/tokens?
Why do you need a separate desktop app? Why not CLI or VS Code extension.
GP2.5 does have a different flavor than S3.7 but it's hard to say that one is better or worse than the other [edit: at tool calling -- GP2.5 is definitely smarter in general]. GP2.5 is I would say a bit more aggressive at doing "speculative" tool execution in parallel with the architect, e.g. spawning multiple search agent calls at the same time, which for Brokk is generally a good thing but I could see use cases where you'd want to dial that back.
Yesterday was for me a milestone, i connected Claude Code through MCP with Jira (sse). I asked it to create a plan for a specific Jira issue, ah, excuse me, work item.
CC created the plan based on the item’s description and started coding. It created a branch (wrong naming convention, needs fix), made the code changes and pushed. Since the Jira item had a good description, the plan was solid and the code so far as well.
Disclaimer; this was a simple problem to solve, but the code base is pretty large.
Here's our (slightly more complicated) agent loop: https://github.com/All-Hands-AI/OpenHands/blob/f7cb2d0f64666...
I don’t understand why do you need a separate UI instead of using local IDE (Cursor/Windsurf), vs code extension (augment) or CLI (Amazon Q developer). Please do not reinvent the wheel.
My personal experience with building such agents is kinda frustrating so far. But I was only vibe coding for a small amount of time, maybe I need to invest more.
The elegance of the system unfolded when I realized that I can not specify any interaction rules beforehand — I just talk to the system, it saves notes for itself, and later acts upon them. I've only started testing it, but so far it's been working as intended.
I used ollama to build this and ollama supports tool calling natively, by passing a `tools=[...]` in the Python SDK. The tools can be regular Python functions with docstrings that describe the tool use. The SDK handles converting the docstrings into a format the LLM can recognize, so my tool's code documentation becomes the model's source of truth. I can also include usage examples right in the docstring to guide the LLM to work closely with all my available tools. No system prompt needed!
Moreover, I wrote all my tools in a separate module, and just use `inspect.getmembers` to construct the `tools` list that i pass to Ollama. So when I need to write a new tool, I just write another function in the tools module and it Just Works™
Paired with qwen 32b running locally, i was fairly satisfied with the output.
It looks like this one does that too.
I was so excited because this was exactly what I coded up today, I jumped straight to the comments.
Here's an AWS post that goes into detail about this approach: https://aws.amazon.com/blogs/machine-learning/build-a-robust...
Just pushed an update this week for OpenAI-compatibility too!
https://github.com/aperoc/toolkami
I can see how an LLM is useful when needing to research which tool arguments to use for a particular situation, but particular situations are infrequent. And based on how frequently wrong coding assistants are with their suggestions, I am leery of letting them run commands against my filesystem.
What am I missing?
But today I went to the next level. I gave the LLM two tools. One web search tool and one REST tool.
I told it at what URL it can find API docs. Then I asked it to perform some tasks for me.
It was really cool to watch an AI read docs, make api calls and try again (REPL) until it worked
Anyone using any opensource tooling that bundles this effectively to allow different local models to be used in this fashion?
I am thinking this would be nice to run fully locally to access my code or my private github repos from my commandline and switch models out (assuming through llama.ccp or Ollama)?
The quality of the chats still amazes me months later.
Where we find it got something wrong, we add more detail to the relevant help articles.
If anyone is interested, I tried to put together a minimal library (no dependency) for TypeScript: https://github.com/hbbio/nanoagent
https://benhouston3d.com/blog/building-an-agentic-code-from-...
Here is an example session from a similar script that I <cough>ChatGPT</cough> wrote in Perl, illustrating the same concept:
$ ~/agentloop.pl
YOUR PROMPT:
Is Python installed on this system? If so, which version?
BASH STDIN:
python --version 2>&1 || python3 --version 2>&1
BASH STDOUT:
/bin/bash: python: command not found
Python 3.6.7
LLM RESPONSE:
Python is installed on this system. The version available is Python 3.6.7. However, it looks like the python executable (python) is not directly available, but python3 (or a direct version path) exists and resolves to Python 3.6.7.
If you need to use Python, you should use python3 as the command.
Started with a math visualizer for machine learning, saw an HN post for this soon after and scrapped it. It was better done by someone else.
Started on an LLM app that looped outputs, saw this post soon after and scrapped it. It was better done by someone else.
It is like every single original notion I have is immediately done by someone else at the exact same time.
I think I will just move on to rudimentary systems programming stuff and avoid creative and original thinking, just need basic and low profile employment.
If it helps, "TFA" was not the originator here and is merely simplifying concepts from fairly established implementations in the wild. As simonw mentions elsewhere, it goes back to at least the ReAct paper and maybe even more if you consider things like retrieval-augmented generation.
Does anyone know of a fix? I'm using the OpenAI agents SDK.