The Unreasonable Effectiveness of an LLM Agent Loop with Tool Use

35 crawshaw 16 5/15/2025, 7:33:44 PM sketch.dev ↗

Comments (16)

kgeist · 16m ago
Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually - just feeding compilation errors, warnings, and suggestions in a loop via the canvas interface. The file was small, around 150 lines.

It didn't go well. I started with 4o:

- It used a deprecated package.

- After I pointed that out, it didn't update all usages - so I had to fix them manually.

- When I suggested a small logic change, it completely broke the syntax (we're talking "foo() } return )))" kind of broken) and never recovered. I gave it the raw compilation errors over and over again, but it didn't even register the syntax was off - just rewrote random parts of the code instead.

- Then I thought, "maybe 4.1 will be better at coding" (as advertized). But 4.1 refused to use the canvas at all. It just explained what I could change - as in, you go make the edits.

- After some pushing, I got it to use the canvas and return the full code. Except it didn't - it gave me a truncated version of the code with comments like "// omitted for brevity".

That's when I gave up.

Do agents somehow fix this? Because as it stands, the experience feels completely broken. I can't imagine giving this access to bash, sounds way too dangerous.

theropost · 35s ago
150 lines? I find can quickly scale to around 1500 lines, and then start more precision on the classes, and functions I am looking to modify
abiraja · 58s ago
GPT4o and 4.1 are definitely not the best models to use here. Use Claude 3.5/3.7, Gemini Pro 2.5 or o3. All of them work really well for small files.
visarga · 6m ago
You should try Cursor or Windsurf, with Claude or Gemini model. Create a documentation file first. Generate tests for everything. The more the better. Then let it cycle 100 times until tests pass.

Normal programming is like walking, deliberate and sure. Vibe coding is like surfing, you can't control everything, just hit yes on auto. Trust the process, let it make mistakes and recover on its own.

85392_school · 10m ago
Agents definitely fix this. When you can run commands and edit files, the agent can test its code by itself and fix any issues.
vFunct · 13m ago
Use Claude Sonnet with an IDE.
libraryofbabel · 8m ago
Strongly recommend this blog post too which is a much more detailed and persuasive version of the same point. The author actually goes and builds a coding agent from zero: https://ampcode.com/how-to-build-an-agent?utm_source=substac...

It is indeed astonishing how well a loop with an LLM that can call tools works for all kinds of tasks now. Yes, sometimes they go off the rails, there is the problem of getting that last 10% of reliability, etc. etc., but if you're not at least a little bit amazed then I urge you go to and hack together something like this yourself, which will take you about 30 minutes. It's possible to have a sense of wonder about these things without giving up your healthy skepticism of whether AI is actually going to be effective for this or that use case.

This "unreasonable effectiveness" of putting the LLM in a loop also accounts for the enormous proliferation of coding agents out there now: Claude Code, Windsurf, Cursor, Cline, Copilot, Aider, Codex... and a ton of also-rans; as one HN poster put it the other day, it seems like everyone and their mother is writing one. The reason is that there is no secret sauce and 95% of the magic is in the LLM itself and how it's been fine-tuned to do tool calls. One of the lead developers of Claude Code candidly admits this in a recent interview.[0] Of course, a ton of work goes into making these tools work well, but ultimately they all have the same simple core.

[0] https://www.youtube.com/watch?v=zDmW5hJPsvQ

jbellis · 17m ago
Yes!

Han Xiao at Jina wrote a great article that goes into a lot more detail on how to turn this into a production quality agentic search: https://jina.ai/news/a-practical-guide-to-implementing-deeps...

This is the same principle that we use at Brokk for Search and for Architect. (https://brokk.ai/)

The biggest caveat: some models just suck at tool calling, even "smart" models like o3. I only really recommend Gemini Pro 2.5 for Architect (smart + good tool calls); Search doesn't require as high a degree of intelligence and lots of models work (Sonnet 3.7, gpt-4.1, Grok 3 are all fine).

crawshaw · 8m ago
I'm curious about your experiences with Gemini Pro 2.5 tool calling. I have tried using it in agent loops (in fact, sketch has some rudimentary support I added), and compared with the Anthropic models I have had to actively reprompt Gemini regularly to make tool calls. Do you consider it equivalent to Sonnet 3.7? Has it required some prompt engineering?
jbellis · 2m ago
Confession time: litellm still doesn't support parallel tool calls with Gemini models [https://github.com/BerriAI/litellm/issues/9686] so we wrote our own "parallel tool calls" on top of Structured Output. It did take a few iterations on the prompt design but all of it was "yeah I can see why that was ambiguous" kinds of things, no real complaints.

GP2.5 does have a different flavor than S3.7 but it's hard to say that one is better or worse than the other. GP2.5 is I would say a bit more aggressive at doing "speculative" tool execution in parallel with the architect, e.g. spawning multiple search agent calls at the same time, which for Brokk is generally a good thing but I could see use cases where you'd want to dial that back.

_bin_ · 44m ago
I've found sonnet-3.7 to be incredibly inconsistent. It can do very well but has a strong tendency to get off-track and run off and do weird things.

3.5 is better for this, ime. I hooked claude desktop up to an MCP server to fake claude-code less the extortionate pricing and it works decently. I've been trying to apply it for rust work; it's not great yet (still doesn't really seem to "understand" rust's concepts) but can do some stuff if you make it `cargo check` after each change and stop it if it doesn't.

I expect something like o3-high is the best out there (aider leaderboards support this) either alone or in combination with 4.1, but tbh that's out of my price range. And frankly, I can't mentally get past paying a very high price for an LLM response that may or may not be useful; it leaves me incredibly resentful as a customer that your model can fail the task, requiring multiple "re-rolls", and you're passing that marginal cost to me.

agilebyte · 38m ago
I am avoiding the cost of API access by using the chat/ui instead, in my case Google Gemini 2.5 Pro with the high token window. Repomix a whole repo. Paste it in with a standard prompt saying "return full source" (it tends to not follow this instruction after a few back and forths) and then apply the result back on top of the repo (vibe coded https://github.com/radekstepan/apply-llm-changes to help me with that). Else yeah, $5 spent on Cline with Claude 3.7 and instead of fixing my tests, I end up with if/else statements in the source code to make the tests pass.
nico · 1m ago
Cool tool. What format does it expect from the model?

I’ve been looking for something that can take “bare diffs” (unified diffs without line numbers), from the clipboard and then apply them directly on a buffer (an open file in vscode)

None of the paste diff extension for vscode work, as they expect a full unified diff/patch

I also tried a google-developed patch tool, but also wasn’t very good at taking in the bare diffs, and def couldn’t do clipboard

actsasbuffoon · 7m ago
I decided to experiment with Claude Code this month. The other day it decided the best way to fix the spec was to add a conditional to the test that causes it to return true before getting to the thing that was actually supposed to be tested.

I’m finding it useful for really tedious stuff like doing complex, multi step terminal operations. For the coding… it’s not been great.

harvey9 · 12m ago
Guess it was trained by scraping thedailywtf.com
layoric · 13m ago
I've been using Mistral Medium 3 last couple of days, and I'm honestly surprised at how good it is. Highly recommend giving it a try if you haven't, especially if you are trying to reduce costs. I've basically switched from Claude to Mistral and honestly prefer it even if costs were equal.