The unreasonable effectiveness of an LLM agent loop with tool use

170 crawshaw 96 5/15/2025, 7:33:44 PM sketch.dev ↗

Comments (96)

libraryofbabel · 3h ago
Strongly recommend this blog post too which is a much more detailed and persuasive version of the same point. The author actually goes and builds a coding agent from zero: https://ampcode.com/how-to-build-an-agent

It is indeed astonishing how well a loop with an LLM that can call tools works for all kinds of tasks now. Yes, sometimes they go off the rails, there is the problem of getting that last 10% of reliability, etc. etc., but if you're not at least a little bit amazed then I urge you go to and hack together something like this yourself, which will take you about 30 minutes. It's possible to have a sense of wonder about these things without giving up your healthy skepticism of whether AI is actually going to be effective for this or that use case.

This "unreasonable effectiveness" of putting the LLM in a loop also accounts for the enormous proliferation of coding agents out there now: Claude Code, Windsurf, Cursor, Cline, Copilot, Aider, Codex... and a ton of also-rans; as one HN poster put it the other day, it seems like everyone and their mother is writing one. The reason is that there is no secret sauce and 95% of the magic is in the LLM itself and how it's been fine-tuned to do tool calls. One of the lead developers of Claude Code candidly admits this in a recent interview.[0] Of course, a ton of work goes into making these tools work well, but ultimately they all have the same simple core.

[0] https://www.youtube.com/watch?v=zDmW5hJPsvQ

wepple · 1h ago
Ah, it’s Thorsten Ball!

I thoroughly enjoyed his “writing an interpreter”. I guess I’m going to build an agent now.

meander_water · 1h ago
There's also this one which uses pocketflow, a graph abstraction library to create something similar [0]. I've been using it myself and love the simplicity of it.

[0] https://github.com/The-Pocket/PocketFlow-Tutorial-Cursor/blo...

kcorbitt · 2h ago
For "that last 10% of reliability" RL is actually working pretty well right now too! https://openpipe.ai/blog/art-e-mail-agent
aibrother · 1h ago
thanks for the rec. and yeah agreed with the observations as well
sesm · 2h ago
Should we change the link above to use `?utm_source=hn&utm_medium=browser` before opening it?
libraryofbabel · 2h ago
fixed :)
kgeist · 3h ago
Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually - just feeding compilation errors, warnings, and suggestions in a loop via the canvas interface. The file was small, around 150 lines.

It didn't go well. I started with 4o:

- It used a deprecated package.

- After I pointed that out, it didn't update all usages - so I had to fix them manually.

- When I suggested a small logic change, it completely broke the syntax (we're talking "foo() } return )))" kind of broken) and never recovered. I gave it the raw compilation errors over and over again, but it didn't even register the syntax was off - just rewrote random parts of the code instead.

- Then I thought, "maybe 4.1 will be better at coding" (as advertized). But 4.1 refused to use the canvas at all. It just explained what I could change - as in, you go make the edits.

- After some pushing, I got it to use the canvas and return the full code. Except it didn't - it gave me a truncated version of the code with comments like "// omitted for brevity".

That's when I gave up.

Do agents somehow fix this? Because as it stands, the experience feels completely broken. I can't imagine giving this access to bash, sounds way too dangerous.

voidspark · 6m ago
The default chat interface is the wrong tool for the job.

The LLM needs context.

https://github.com/marv1nnnnn/llm-min.txt

The LLM is a problem solver but not a repository of documentation. Neural networks are not designed for that. They model at a conceptual level. It still needs to look up specific API documentation like human developers.

You could use o3 and ask it to search the web for documentation and read that first, but it's not efficient. The professional LLM coding assistant tools manage the context properly.

simonw · 2h ago
"It used a deprecated package"

That's because models have training cut-off dates. It's important to take those into account when working with them: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#a...

I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

You can tell it "look up the most recent version of library X and use that" and it will often work!

I even used it for a frustrating upgrade recently - I pasted in some previous code and prompted this:

This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.

It did exactly what I asked: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

kgeist · 2h ago
>That's because models have training cut-off dates

When I pointed out that it used a deprecated package, it agreed and even cited the correct version after which it was deprecated (way back in 2021). So it knows it's deprecated, but the next-token prediction (without reasoning or tools) still can't connect the dots when much of the training data (before 2021) uses that package as if it's still acceptable.

>I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

Thanks for the tip!

jmcpheron · 2h ago
>I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

That is such a useful distinction. I like to think I'm keeping up with this stuff, but the '4o' versus 'o4' still throws me.

fragmede · 2h ago
There's still skill involved with using the LLM in coding. In this case, o4-mini-high might do the trick, but the easier answer that worry's with other models is to include the high level library documentation yourself as context and it'll use that API.
thorum · 2h ago
GPT 4.1 and 4o score very low on the Aider coding benchmark. You only start to get acceptable results with models that score 70%+ in my experience. Even then, don't expect it to do anything complex without a lot of hand-holding. You start to get a sense for what works and what doesn't.

https://aider.chat/docs/leaderboards/

ebiester · 2h ago
I get that it's frustrating to be told "skill issue," but using an LLM is absolutely a skill and there's a combination of understanding the strengths of various tools, experimenting with them to understand the techniques, and just pure practice.

I think if I were giving access to bash, though, it would definitely be in a docker container for me as well.

wtetzner · 27m ago
Sure, you can probably get better at it, but is it really worth the effort over just getting better at programming?
kgeist · 3m ago
When I have to explain what I want, and review it, and correct it, in as many words as it would take me to write the code myself (+ good autocomplete), then it's the same programming as before but with extra steps. I can see why it can be good for generating boilerplate, but boilerplate can also be solved with good abstractions and libraries (code reuse), so you don't have to write everything over and over again. So far I've found that LLMs are good as a starting point, when you have no idea where to even start (not only for coding, but in general). Everything else... feels less productive so far.
cyral · 18m ago
You can do both
fsndz · 2h ago
I can be frustrating at times. but my experience is the more you try the better you become at knowing what to ask and to expect. But I guess you understand now why some people say vibe coding is a bit overrated: https://www.lycee.ai/blog/why-vibe-coding-is-overrated
the_af · 1h ago
"Overrated" is one way to call it.

Giving sharp knives to monkeys would be another.

danbmil99 · 1h ago
As others have noted, you sound about 3 months behind the leading edge. What you describe is like my experience from February.

Switch to Claude (IMSHO, I think Gemini is considered on par). Use a proper coding tool, cutting & pasting from the chat window is so last week.

abiraja · 3h ago
GPT4o and 4.1 are definitely not the best models to use here. Use Claude 3.5/3.7, Gemini Pro 2.5 or o3. All of them work really well for small files.
codethief · 2h ago
The other day I used the Cline plugin for VSCode with Claude to create an Android app prototype from "scratch", i.e. starting from the usual template given to you by Android Studio. It produced several thousand lines of code, there was not a single compilation error, and the app ended up doing exactly what I wanted – modulo a bug or two, which were caused not by the LLM's stupidity but by weird undocumented behavior of the rather arcane Android API in question. (Which is exactly why I wanted a quick prototype.)

After pointing out the bugs to the LLM, it successfully debugged them (with my help/feedback, i.e. I provided the output of the debug messages it had added to the code) and ultimately fixed them. The only downside was that I wasn't quite happy with the quality of the fixes – they were more like dirty hacks –, but oh well, after another round or two of feedback we got there, too. I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.

nico · 3h ago
4o and 4.1 are not very good at coding

My best results are usually with 4o-mini-high, o3 is sometimes pretty good

I personally don’t like the canvas. I prefer the output on the chat

And a lot of times I say: provide full code for this file, or provide drop-in replacement (when I don’t want to deal with all the diffs). But usually at around 300-400 lines of code, it starts getting bad and then I need to refactor to break stuff up into multiple files (unless I can focus on just one method inside a file)

manmal · 2h ago
o3 is shockingly good actually. I can’t use it often due to rate limiting, so I save it for the odd occasion. Today I asked it how I could integrate a tree of Swift binary packages within an SDK, and detect internal version clashes, and it gave a very well researched and sensible overview. And gave me a new idea that I‘ll try.
kenjackson · 1h ago
I use o3 for anything math or coding related. 4o is good for things like, "my knee hurts when I do this and that -- what might it be?"
hnhn34 · 1h ago
Just in case you didn't know, they raised the rate limit from ~50/week to ~50/day a while ago
johnsmith1840 · 2h ago
Drop in replacement files per update should be done on the heavy test time compute methods.

o1-pro, o1-preview can generate updated full file responses into the 1k LOC range.

It's something about their internal verification methods that make it an actual viable development method.

nico · 1h ago
True. Also, the APIs don't care too much about restricting output length, they might actually be more verbose to charge more

It's interesting how the same model being served through different interfaces (chat vs api), can behave differently based on the economic incentives of the providers

Jarwain · 3h ago
Aider's benchmarks show 4.1 (and 4o) work better in its architect mode, for planning the changes, and o3 for making the actual edits
SparkyMcUnicorn · 42m ago
You have that backwards. The leaderboard results have the thinking model as the architect.

In this case, o3 is the architect and 4.1 is the editor.

visarga · 3h ago
You should try Cursor or Windsurf, with Claude or Gemini model. Create a documentation file first. Generate tests for everything. The more the better. Then let it cycle 100 times until tests pass.

Normal programming is like walking, deliberate and sure. Vibe coding is like surfing, you can't control everything, just hit yes on auto. Trust the process, let it make mistakes and recover on its own.

prisenco · 2h ago
Given that analogy, surely you could understand why someone would much rather walk than surf to their destination? Especially people who are experienced marathon runners.
fragmede · 2h ago
If I tried standing up on the waves without a surfboard, and complain about how it's not working, would you blame the water or surfing for the issue, or the person trying to defy physics, complaining that it's not working? It doesn't matter how much I want to run or if I'm Kelvin Kiptum, I'm gonna have a bad time.
prisenco · 1h ago
That only makes sense when surfing is the only way to get to the destination and that's not the case.
fragmede · 55m ago
Say there are two ways to get to your destination. You still need to use the appropriate vehicle/surfboard for the route you've chosen to use. Even if there is a bridge you can run/walk across, if you try and surf across the water without a surfboard, and try to walk it, you're gonna have a bad time.
prisenco · 36m ago
Analogy feels a bit tortured at this point.
tqwhite · 1h ago
I find that writing a thorough design spec is really worth it. Also, asking for its reaction. "What's missing?" "Should I do X or Y" does good things for its thought process, like engaging a younger programmer in the process.

Definitely, I ask for a plan and then, even if it's obvious, I ask questions and discuss it. I also point it as samples of code that I like with instructions for what is good about it.

Once we have settled on a plan, I ask it to break it into phases that can be tested (I am not one for a unit testing) to lock in progress. Claude LOVES that. It organizes a new plan and, at the end of each phase tells me how to test (curl, command line, whatever is appropriate) and what I should see that represents success.

The most important thing I have figured out is that Claude is a collaborator, not a minion. I agree with visarga, it's much more like surfing that walking. Also, Trust... but Verify.

This is a great time to be a programmer.

smcleod · 3h ago
GPT 4o and 4.1 are both pretty terrible for coding to be honest, try Sonnet 3.7 in Cline (VSCode extension).

LLMs don't have up to date knowledge of packages by themselves that's a bit like buying a book and expecting it to have up to date world knowledge, you need to supplement / connect it to a data source (e.g. web search, documentation and package version search etc.).

85392_school · 3h ago
Agents definitely fix this. When you can run commands and edit files, the agent can test its code by itself and fix any issues.
LewisVerstappen · 2h ago
skill issue.

The fact that you're using 4o and 4.1 rather than claude is already a huge mistake in itself.

> Because as it stands, the experience feels completely broken

Broken for you. Not for everyone else.

theropost · 3h ago
150 lines? I find can quickly scale to around 1500 lines, and then start more precision on the classes, and functions I am looking to modify
jokethrowaway · 2h ago
It's completely broken for me over 400 lines (Claude 3.7, paid Cursor)

The worst is when I ask something complex, the model generates 300 lines of good code and then timeouts or crashes. If I ask to continue it will mess up the code for good, eg. starts generating duplicated code or functions which don't match the rest of the code.

tqwhite · 1h ago
Definitely a new skill to learn. Everyone I know that is having problems is just telling it what to do, not coaching it. It is not an automaton... instructions in code out. Treat it like a team member that will do the work if you teach it right and you will have much more success.

But is definitely a learning process for you.

johnsmith1840 · 2h ago
It's a new skill that takes time to learn. When I started on gpt3.5 it took me easily 6 months of daily use before I was making real progress with it.

I regularly generate and run in the 600-1000LOC range.

Not sure you would call it "vibe coding" though as the details and info you provide it and how you provide it is not simple.

I'd say realistically it speeds me up 10x on fresh greenfield projects and maybe 2x on mature systems.

You should be reading the code coming out. The real way to prevent errors is read the resoning and logic. The moment you see a mistep go back and try the prompt again. If that fails try a new session entirely.

Test time compute models like o1-pro or the older o1-preview are massively better at not putting errors in your code.

Not sure about the new claude method but true, slow test time models are MASSIVELY better at coding.

koakuma-chan · 2h ago
Sounds like a Cursor issue
fragmede · 2h ago
what language?
koakuma-chan · 3h ago
You gotta use a reasoning model.
vFunct · 3h ago
Use Claude Sonnet with an IDE.
hollownobody · 3h ago
Try o3 please. Via UI.
fragmede · 2h ago
In this case, sorry to say but it sounds like there's a tooling issue, possibly also a skill issue. Of course you can just use the raw ChatGPT web interface but unless you seriously tune its system/user prompt, it's not going to match what good tooling (which sets custom prompts) will get you. Which is kind of counter-intuitive. A paragraph or three fed in as the system prompt is enough to influence behavior/performance so significantly? It turns out with LLMs the answer is yes.
benoau · 47m ago
This morning I used cursor to extract a few complex parts of my game prototype's "main loop", and then generate a suite of tests for those parts. In total I have 341 tests written by Cursor covering all the core math and other components.

It has been a bit like herding cats sometimes, it will run away with a bad idea real fast, but the more constraints I give it telling it what to use, where to put it, giving it a file for a template, telling it what not to do, the better the results I get.

In total it's given me 3500 lines of test code that I didn't need to write, don't need to fix, and can delete and regenerate if underlying assumptions change. It's also helped tune difficulty curves, generate mission variations and more.

tqwhite · 2h ago
I've been using Claude Code, ie, a terminal interface to Sonnet 3.7 since the day it came out in mid March. I have done substantial CLI apps, full stack web systems and a ton of utility crap. I am much more ambitious because of it, much as I was in the past when I was running a programming team.

I'm sure it is much the same as this under the hood though Anthropic has added many insanely useful features.

Nothing is perfect. Producing good code requires about the same effort as it did when I was running said team. It is possible to get complicated things working and find oneself in a mess where adding the next feature is really problematic. As I have learned to drive it, I have to do much less remediation and refactoring. That will never go away.

I cannot imagine what happened to poor kgeist. I have had Claude make choices I wouldn't and do some stupid stuff, never enough that I would even think about giving up on it. Almost always, it does a decent job and, for a most stuff, the amount of work it takes off of my brain is IMMENSE.

And, for good measure, it does a wonderful job of refactoring. Periodically, I have a session where I look at the code, decide how it could be better and instruct Claude. Huge amounts of complexity, done. "Change this data structure", done. It's amazingly cool.

And, just for fun, I opened it in a non-code archive directory. It was a junk drawer that I've been filling for thirty years. "What's in this directory?" "Read the old resumes and write a new one." "What are my children's names?" Also amazing.

And this is still early days. I am so happy.

simonw · 2h ago
I'm very excited about tool use for LLMs at the moment.

The trick isn't new - I first encountered it with the ReAcT paper two years ago - https://til.simonwillison.net/llms/python-react-pattern - and it's since been used for ChatGPT plugins, and recently for MCP, and all of the models have been trained with tool use / function calls in mind.

What's interesting today is how GOOD the models have got at it. o3/o4-mini's amazing search performance is all down to tool calling. Even Qwen3 4B (2.6GB from Ollama, runs happily on my Mac) can do tool calling reasonably well now.

I gave a workshop at PyCon US yesterday about building software on top of LLMs - https://simonwillison.net/2025/May/15/building-on-llms/ - and used that as an excuse to finally add tool usage to an alpha version of my LLM command-line tool. Here's the section of the workshop that covered that:

https://building-with-llms-pycon-2025.readthedocs.io/en/late...

My LLM package can now reliably count the Rs in strawberry as a shell one-liner:

  llm --functions '
  def count_char_in_string(char: str, string: str) -> int:
      """Count the number of times a character appears in a string."""
      return string.lower().count(char.lower())
  ' 'Count the number of Rs in the word strawberry' --td
DarmokJalad1701 · 2h ago
Was the workshop recorded?
simonw · 1h ago
No video or audio, just my handouts.
andrewmcwatters · 2h ago
I love the odd combination of silliness and power in this.
cadamsdotcom · 1h ago
> "Oh, this test doesn't pass... let's just skip it," it sometimes says, maddeningly.

Here is a wild idea. Imagine running a companion, policy-enforcing LLM, independently and in parallel, which is given instructions to keep the main LLM behaving according to instructions.

If the companion LLM could - in real time - ban the coding LLM from emitting "let's just skip it" by seeing the tokens "let's just" and then biasing the output such that the word "skip" becomes impossible to emit.

Banning the word "skip" from following "let's just", forces the LLM down a new path away from the undesired behavior.

It's like Structured Outputs or JSON mode, but driven by a companion LLM, and dynamically modified in real time as tokens are emitted.

If the idea works, you could prompt the companion LLM to do more advanced stuff - eg. ban a coding LLM from making tests pass by deleting the test code, ban it from emitting pointless comments... all the policies that we put into system prompts today and pray the LLM will do, would go into the companion LLM's prompt instead.

Wonder what the Outlines folks think of this!

JoshuaDavid · 55m ago
Along these lines, if the main LLM goes down a bad path, you could _rewind_ the model to before it started going down the bad path -- the watcher LLM doesn't necessarily have to guess that "skip" is a bad token after the words "let's just", it could instead see "let's just skip the test" and go "nope, rolling back to the token "just " and rerolling with logit_bias={"skip":-10,"omit":-10,"hack":-10}".

Of course doing that limits which model providers you can work with (notably, OpenAI has gotten quite hostile to power users doing stuff like that over the past year or so).

cadamsdotcom · 10m ago
That’s a really neat idea.

Kind of seems an optimization: if the “token ban” is a tool call, you can see that being too slow to run for every token. Provided rewinding is feasible, your idea could make it performant enough to be practical.

panarky · 45m ago
If it works to run a second LLM to check the first LLM, then why couldn't a "mixture of experts" LLM dedicate one of its experts to checking the results of the others? Or why couldn't a test-time compute "thinking" model run a separate thinking thread that verifies its own output? And if that gets you 60% of the way there, then there could be yet another thinking thread that verifies the verifier, etc.
magicalhippo · 52m ago
Assuming the title is a play on the paper "The Unreasonable Effectiveness of Mathematics in the Natural Sciences"[1][2] by Eugene Wigner.

[1]: https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...

[2]: https://www.hep.upenn.edu/~johnda/Papers/wignerUnreasonableE...

gavmor · 25m ago
That may be its primogenitor, but it's long since become a meme: https://scholar.google.com/scholar?q=unreasonable+effectiven...
dsubburam · 25m ago
I didn't know of that paper, and thought the title was a riff on Karpathy's Unreasonable Effectiveness of RNNs in 2015[1]. Even if my thinking is correct, as it very well might be given the connection RNNs->LLMs, Karpathy might have himself made his title a play on Wigner's (though he doesn't say so).

[1] https://karpathy.github.io/2015/05/21/rnn-effectiveness/

throwaway314155 · 1m ago
Unreasonable effectiveness of [blah] has been a thing for decades if not centuries. It's not new.
hbbio · 18m ago
Yes, agent loops are simple, except, as the article says, a bit of "pump and circumstance"!

If anyone is interested, I tried to put together a minimal library (no dependency) for TypeScript: https://github.com/hbbio/nanoagent

outworlder · 1h ago
> If you don't have some tool installed, it'll install it.

Terrifying. LLMs are very 'accommodating' and all they need is someone asking them to do something. This is like SQL injection, but worse.

kuahyeow · 1h ago
What protection do people use when enabling an LLM to run `bash` on your machine ? Do you run it in a Docker container / LXC boundary ? `chroot` ?
CGamesPlay · 38m ago
The blog post in question is on the site for Sketch, which appears to use Docker containers. That said, I use Claude Code, which just uses unsandboxed commands with manual approval.

What's your concern? An accident or an attacker? For accidents, I use git and backups and develop in a devcontainer. For an attacker, bash just seems like an ineffective attack vector; I would be more worried about instructing the agent to write a reverse shell directly into the code.

rbren · 2h ago
If you're interested in hacking on agent loops, come join us in the OpenHands community!

Here's our (slightly more complicated) agent loop: https://github.com/All-Hands-AI/OpenHands/blob/f7cb2d0f64666...

_bin_ · 4h ago
I've found sonnet-3.7 to be incredibly inconsistent. It can do very well but has a strong tendency to get off-track and run off and do weird things.

3.5 is better for this, ime. I hooked claude desktop up to an MCP server to fake claude-code less the extortionate pricing and it works decently. I've been trying to apply it for rust work; it's not great yet (still doesn't really seem to "understand" rust's concepts) but can do some stuff if you make it `cargo check` after each change and stop it if it doesn't.

I expect something like o3-high is the best out there (aider leaderboards support this) either alone or in combination with 4.1, but tbh that's out of my price range. And frankly, I can't mentally get past paying a very high price for an LLM response that may or may not be useful; it leaves me incredibly resentful as a customer that your model can fail the task, requiring multiple "re-rolls", and you're passing that marginal cost to me.

agilebyte · 3h ago
I am avoiding the cost of API access by using the chat/ui instead, in my case Google Gemini 2.5 Pro with the high token window. Repomix a whole repo. Paste it in with a standard prompt saying "return full source" (it tends to not follow this instruction after a few back and forths) and then apply the result back on top of the repo (vibe coded https://github.com/radekstepan/apply-llm-changes to help me with that). Else yeah, $5 spent on Cline with Claude 3.7 and instead of fixing my tests, I end up with if/else statements in the source code to make the tests pass.
actsasbuffoon · 3h ago
I decided to experiment with Claude Code this month. The other day it decided the best way to fix the spec was to add a conditional to the test that causes it to return true before getting to the thing that was actually supposed to be tested.

I’m finding it useful for really tedious stuff like doing complex, multi step terminal operations. For the coding… it’s not been great.

nico · 3h ago
I’ve had this in different ways many times. Like instead of resolving the underlying issue for an exception, it just suggests catching the exception and keep going

It also depends a lot on the mix of model and type of code and libraries involved. Even in different days the models seem to be more or less capable (I’m assuming they get throttled internally - this is very noticeable sometimes in how they try to save on output tokens and summarize the code responses as much as possible, at least in the chat/non-api interfaces)

christophilus · 2h ago
Well, that’s proof that it used my GitHub projects in its training data.
nico · 3h ago
Cool tool. What format does it expect from the model?

I’ve been looking for something that can take “bare diffs” (unified diffs without line numbers), from the clipboard and then apply them directly on a buffer (an open file in vscode)

None of the paste diff extension for vscode work, as they expect a full unified diff/patch

I also tried a google-developed patch tool, but also wasn’t very good at taking in the bare diffs, and def couldn’t do clipboard

agilebyte · 3h ago
Markdown format with a comment saying what the file path is. So:

This is src/components/Foo.tsx

```tsx // code goes here ```

OR

```tsx // src/components/Foo.tsx // code goes here ```

These seem to work the best.

I tried diff syntax, but Gemini 2.5 just produced way too many bugs.

I also tried using regex and creating an AST of the markdown doc and going from there, but ultimately settled on calling gpt-4.1-mini-2025-04-14 with the beginning of the code block (```) and 3 lines before and 3 lines after the beginning of the code block. It's fast/cheap enough to work.

Though I still have to make edits sometimes. WIP.

No comments yet

harvey9 · 3h ago
Guess it was trained by scraping thedailywtf.com
layoric · 3h ago
I've been using Mistral Medium 3 last couple of days, and I'm honestly surprised at how good it is. Highly recommend giving it a try if you haven't, especially if you are trying to reduce costs. I've basically switched from Claude to Mistral and honestly prefer it even if costs were equal.
nico · 3h ago
How are you running the model? Mistral’s api or some local version through ollama, or something else?
layoric · 13m ago
Through OpenRouter, medium 3 isn't open weights.
kyleee · 2h ago
Is mistral on open router?
nico · 1h ago
johnsmith1840 · 2h ago
I seem to be alone in this but the only methods truly good at coding are slow heavy test time compute models.

o1-pro and o1-preview are the only models I've ever used that can reliably update and work with 1000 LOC without error.

I don't let o3 write any code unless it's very small. Any "cheap" model will hallucinate or fail massively when pushed.

One good tip I've done lately. Remove all comments in your code before passing or using LLMs, don't let LLM generated comments persist under any circumstance.

_bin_ · 1h ago
Interesting. I've never tested o1-pro because it's insanely expensive but preview seemed to do okay.

I wouldn't be shocked if huge, expensive-to-run models performed better and if all the "optimized" versions were actually labs trying to ram cheaper bullshit down everyone's throat. Basically chinesium for LLMs; you can afford them but it's not worth it. I remember someone saying o1 was, what, 200B dense? I might be misremembering.

johnsmith1840 · 1h ago
I'm positive they are pushing users to cheaper models due to cost. o1-pro is now in a sub menu for pro users and labled legacy. The big inference methods must be stupidly expensive.

o1-preview was and possibly still is the most powerful model they ever released. I only switched to pro for coding after months of them improving it and my api bill getting a bit crazy (like 0.50$ per question).

I don't think paramater count matters anymore. I think the only thing that matters is how much compute a vendor will give you per question.

mukesh610 · 2h ago
I built this very same thing today! The only difference is that i pushed the tool call outputs into the conversation history and resent it back to the LLM for it to summarize, or perform further tool calls, if necessary, automagically.

I used ollama to build this and ollama supports tool calling natively, by passing a `tools=[...]` in the Python SDK. The tools can be regular Python functions with docstrings that describe the tool use. The SDK handles converting the docstrings into a format the LLM can recognize, so my tool's code documentation becomes the model's source of truth. I can also include usage examples right in the docstring to guide the LLM to work closely with all my available tools. No system prompt needed!

Moreover, I wrote all my tools in a separate module, and just use `inspect.getmembers` to construct the `tools` list that i pass to Ollama. So when I need to write a new tool, I just write another function in the tools module and it Just Works™

Paired with qwen 32b running locally, i was fairly satisfied with the output.

degamad · 2h ago
> The only difference is that i pushed the tool call outputs into the conversation history and resent it back to the LLM for it to summarize, or perform further tool calls, if necessary, automagically.

It looks like this one does that too.

     msg = [ handle_tool_call(tc) for tc in tool_calls ]
mukesh610 · 2h ago
Ah, failed to notice that.

I was so excited because this was exactly what I coded up today, I jumped straight to the comments.

jawns · 2h ago
Not only can this be an effective strategy for coding tasks, but it can also be used for data querying. Picture a text-to-SQL agent that can query database schemas, construct queries, run explain plans, inspect the error outputs, and then loop several times to refine. That's the basic architecture behind a tool I built, and I have been amazed at how well it works. There have been multiple times when I've thought, "Surely it couldn't handle THIS prompt," but it does!

Here's an AWS post that goes into detail about this approach: https://aws.amazon.com/blogs/machine-learning/build-a-robust...

BrandiATMuhkuh · 2h ago
That's really cool. One week ago I implemented an SQL tool and it works really well. But sometimes it still just makes up table/column names. Luckily it can read the error and correct itself.

But today I went to the next level. I gave the LLM two tools. One web search tool and one REST tool.

I told it at what URL it can find API docs. Then I asked it to perform some tasks for me.

It was really cool to watch an AI read docs, make api calls and try again (REPL) until it worked

jbellis · 3h ago
Yes!

Han Xiao at Jina wrote a great article that goes into a lot more detail on how to turn this into a production quality agentic search: https://jina.ai/news/a-practical-guide-to-implementing-deeps...

This is the same principle that we use at Brokk for Search and for Architect. (https://brokk.ai/)

The biggest caveat: some models just suck at tool calling, even "smart" models like o3. I only really recommend Gemini Pro 2.5 for Architect (smart + good tool calls); Search doesn't require as high a degree of intelligence and lots of models work (Sonnet 3.7, gpt-4.1, Grok 3 are all fine).

crawshaw · 3h ago
I'm curious about your experiences with Gemini Pro 2.5 tool calling. I have tried using it in agent loops (in fact, sketch has some rudimentary support I added), and compared with the Anthropic models I have had to actively reprompt Gemini regularly to make tool calls. Do you consider it equivalent to Sonnet 3.7? Has it required some prompt engineering?
jbellis · 3h ago
Confession time: litellm still doesn't support parallel tool calls with Gemini models [https://github.com/BerriAI/litellm/issues/9686] so we wrote our own "parallel tool calls" on top of Structured Output. It did take a few iterations on the prompt design but all of it was "yeah I can see why that was ambiguous" kinds of things, no real complaints.

GP2.5 does have a different flavor than S3.7 but it's hard to say that one is better or worse than the other [edit: at tool calling -- GP2.5 is definitely smarter in general]. GP2.5 is I would say a bit more aggressive at doing "speculative" tool execution in parallel with the architect, e.g. spawning multiple search agent calls at the same time, which for Brokk is generally a good thing but I could see use cases where you'd want to dial that back.

bhouston · 3h ago
I found this out too - it is quite easy and effective:

https://benhouston3d.com/blog/building-an-agentic-code-from-...

amelius · 2h ago
Huh, isn't this already built-in, in most chat UIs?