Tools: Code Is All You Need

275 Bogdanp 200 7/3/2025, 10:51:18 AM lucumr.pocoo.org ↗

Comments (200)

pclowes · 13h ago

Directionally I think this is right. Most LLM usage at scale tends to be filling the gaps between two hardened interfaces. The reliability comes not from the LLM inference and generation but the interfaces themselves only allowing certain configuration to work with them.

LLM output is often coerced back into something more deterministic such as types, or DB primary keys. The value of the LLM is determined by how well your existing code and tools model the data, logic, and actions of your domain.

In some ways I view LLMs today a bit like 3D printers, both in terms of hype and in terms of utility. They excel at quickly connecting parts similar to rapid prototyping with 3d printing parts. For reliability and scale you want either the LLM or an engineer to replace the printed/inferred connector with something durable and deterministic (metal/code) that is cheap and fast to run at scale.

Additionally, there was a minute during the 3D printer Gardner hype cycle where there were notions that we would all just print substantial amounts of consumer goods when the reality is the high utility use case are much more narrow. There is a corollary here to LLM usage. While LLMs are extremely useful we cannot rely on LLMs to generate or infer our entire operational reality or even engage meaningfully with it without some sort of pre-existing digital modeling as an anchor.

foobarbecue · 12h ago

Hype cycle for drones and VR was similar -- at the peak, you have people claiming drones will take over package delivery and everyone will spend their day in VR. Reality is that the applicability is more narrow.

soulofmischief · 11h ago

That's the claim for AR, not VR, and you're just noticing how research and development cycles play out, you can draw comparisons to literally any technology cycle.

65 · 7h ago

That is in fact the claim for VR. Remember the Metaverse? Oculus headsets are VR headsets. The Apple Vision Pro is a VR headset.

outworlder · 4h ago

> The Apple Vision Pro is a VR headset.

For some use cases it is indeed used for VR. But it has AR applications and all the necessary hardware and software.

mumbisChungo · 7h ago

The metaverse is and was a guess at how the children of today might interact as they age into active market participants. Like all these other examples, speculative mania preceded genuine demand and it remains to be seen whether it plays out over the coming 10-15 years.

sizzle · 7h ago

Ahh yes let’s get the next generation addicted to literal screens strapped to their eyeballs for maximum monetization, humanity be damned. Glad it’s a failing bet. Now sex bots might be onto something…

mumbisChungo · 6h ago

It may or may not be a failing bet. Maybe smartphones are the ultimate form of human-data interface and we'll simply never do better.

jrm4 · 6h ago

I'll take your argument a bit further. The thing is -- "human-data" interfaces are not particularly important. Human-Human ones are. This is probably why it's going to be difficult, if not impossible, to beat the smartphone; VR or whatever doesn't fundamentally "bring people closer together" in a way the smartphone nearly absolutely did.

mumbisChungo · 6h ago

VR may not, but social interaction with AR might be more palatable and better UX than social interaction while constantly looking down at at a computer we still call a "phone" for some reason.

freetinker · 3h ago

Drones are delivering. Ukraine?

ivape · 9h ago

You checked out drone warfare? It’s all the rage in every conflict at the moment. The hype around drones is not fake, and I’d compare it more to autonomous cars because regulation is the only reason you don’t see a million private drones flying around.

dazed_confused · 9h ago

Yes, to an extent, but I would say that is an extension of artillery and long-range fire capabilities.

jmj · 8h ago

As is well known, AI is whatever hasn't been done yet.

golergka · 7h ago

People claimed that we would spend most of our day on the internet in the mid-90s, and then the dotcom bubble burst. And then people claimed that by 2015 robo-taxis would be around all the major cities of the planet.

You can be right but too early. There was a hype wave for drones and VR (more than one for the latter one), but I wouldn't be so sure that it's peak of their real world usage yet.

TeMPOraL · 4h ago

Which is why I think there are two distinct kinds of perspective, and for one of them, AI hype is just about at the right levels - and being too early is not a problem, unless it delays things indefinitely.

I wrote about it recently here: https://news.ycombinator.com/item?id=44208831. Quoting myself (sorry):

> For me, one of the Beneficiaries, the hype seems totally warranted. The capability is there, the possibilities are enormous, pace of advancement is staggering, and achieving them is realistic. If it takes a few years longer than the Investor group thinks - that's fine with us; it's only a problem for them.

skeeter2020 · 6h ago

>> You can be right but too early.

Unless opportunity cost is zero this is a varation on being wrong.

jangxx · 11h ago

I mean both of these things are actually happening (drone deliveries and people spending a lot of time in VR), just at a much much smaller scale than it was hyped up to be.

giovannibonetti · 11h ago

Drones and VR require significant upfront hardware investment, which curbs adoption. On the other hand, adopting LLM-as-a-service has none of these costs, so no wonder so many companies are getting involved with it so quickly.

nativeit · 10h ago

Right, but abstract costs are still costs to someone, so how far does that go before mass adoption turns into a mass liability for whomever is ultimately on the hook? It seems like there is this extremely risky wager that everyone is playing--that LLM's will find their "killer app" before the real costs of maintaining them becomes too much to bear. I don't think these kinds of bets often pay off. The opposite actually, I think every truly revolutionary technological advance in the contemporary timeframe has arisen out of its very obvious killer app(s), they were in a sense inevitable. Speculative tech--the blockchain being one of the more salient and frequently tapped examples--tends to work in pretty clear bubbles, in my estimation. I've not yet been convinced this one is any different, aside from the absurd scale at which it has been cynically sold as the biggest thing since Gutenberg, but while that makes it somewhat distinct, it's still a rather poor argument against it being a bubble.

pxc · 11h ago

A parallel outcome for LLMs sounds realistic to me.

deadbabe · 11h ago

If it’s not happening at the scale it was pitched, then it’s not happening.

falcor84 · 10h ago

Considering what we've been seeing in the Russia-Ukraine and Iran-Israel wars, drones are definitely happening at scale. For better or for worse, I expect worldwide production of drones to greatly expand over the coming years.

jangxx · 11h ago

This makes no sense, just because something didn't become as big as the hypemen said it would doesn't make the inventions or users of those inventions disappear.

deadbabe · 9h ago

For something to be considered “happening” you can’t just have a handful of localized examples. It has to be happening at a large noticeable scale that even people unfamiliar with the tech are noticing. Then you can say it’s “happening”. Otherwise, it’s just smaller groups of people doing stuff.

threatofrain · 10h ago

Good drones are very Chinese atm, as is casual consumer drone delivery. Americans might be more than a decade away even with concerted bipartisan war-like effort to boost domestic drone competency.

The reality is Chinese.

sarchertech · 8h ago

Aren’t people building DIY drones that are close to and in some cases superior to off the shelf Chinese drones?

threatofrain · 8h ago

Off the shelf Chinese drones is somewhat vague, we can just say DJI. Their full drone and dock system for the previous generation goes for around $20k. DJI iterates on this space on a yearly cadence and have just come out with the Dock 3.

54 minute flight time (47 min hover) for fully unmanned operations.

If you're talking about fpv racing where tiny drones fly around 140+ mph, then yeah DJI isn't in that space.

sarchertech · 7h ago

That hardly seems like it would take the US 10 years to replicate on a war footing aside from the price.

I mean if we’re talking dollar to dollar comparison, the US will likely never be able to produce something as cheaply as China (unless China drastically increases their average standard of living).

tonyarkles · 4h ago

There’s a really weird phenomenon too with drones. I’ve used Chinese (non-drone) software for work a bunch in the past and it’s been almost universally awful. On the drone side, especially DJI, they’ve flipped this script completely. Every non-DJI drone I’ve flown has had miserable UX in comparison to DJI. Mission Planner (open source, as seen in the Ukraine attack videos) is super powerful but also looks like ass and functions similarly. QGC is a bit better, especially the vendor-customized versions (BSD licensed) but the vendors almost always neuter great features that are otherwise available in the open source version and at the same time modify things so that you can’t talk to the aircraft using the OSS version. The commercial offerings I’ve used are no better.

Sure, we need to be working on being able to build the hardware components in North America, and I’ve seen a bunch of people jump on that in the last year. But wow is the software ever bad and I haven’t really seen anyone working to improve that.

whiplash451 · 12h ago

Interesting take but too bearish on LLMs in my opinion.

LLMs have already found large-scale usage (deep research, translation) which makes them more ubiquitous today than 3D printers ever will or could have been.

benreesman · 12h ago

What we call an LLM today (by which almost everyone means an autogressive language model from the Generative Pretrained Transformer family tree, and BERTs are still doing important eork, believe that) is actually an offshoot of neural machine translation.

This isn't (intentionally at least) mere HN pedantry: they really do act like translation tools in a bunch of observable ways.

And while they have recently crossed the threshold into "yeah, I'm always going to have a gptel buffer open now" territory at the extreme high end, their utility outside of the really specific, totally non-generalizing code lookup gizmo usecase remains a claim unsupported by robust profits.

There is a hole in the ground where something between 100 billion and a trillion dollars in the ground that so far has about 20B in revenue (not profit) going into it annually.

AI is going to be big (it was big ten years ago).

LLMs? Look more and more like the Metaverse every day as concerns the economics.

rapind · 11h ago

> There is a hole in the ground where something between 100 billion and a trillion dollars in the ground that so far has about 20B in revenue (not profit) going into it annually.

This is a concern for me. I'm using claude-code daily and find it very useful, but I'm expecting the price to continue getting jacked up. I do want to support Anthropic, but they might eventually need to cross a price threshold where I bail. We'll see.

I expect at some point the more open models and tools will catch up when the expensive models like ChatGPT plateau (assuming they do plateau). Then we'll find out if these valuations measure up to reality.

Note to the Hypelords: It's not perfect. I need to read every change and intervene often enough. "Vibe coding" is nonsense as expected. It is definitely good though.

juped · 10h ago

I'm just taking advantage and burning VCs' money on useful but not world-changing tools while I still can. We'll come out of it with consumer-level okay tools even if they don't reach the levels of Claude today, though.

strgcmc · 9h ago

As a thought-exercise -- assume models continue to improve, whereas "using claude-code daily" is something you choose to do because it's useful, but is not yet at the level of "absolute necessity, can't imagine work without it". What if it does become, that level of absolute necessity?

- Is your demand inelastic at that point, if having claude-code becomes effectively required, to sustain your livelihood? Does pricing continue to increase, until it's 1%/5%/20%/50% of your salary (because hey, what's the alternative? if you don't pay, then you won't keep up with other engineers and will just lose your job completely)?

- But if tools like claude-code become such a necessity, wouldn't enterprises be the ones paying? Maybe, but maybe like health-insurance in America (a uniquely dystopian thing), your employer may pay some portion of the premiums, but they'll also pass some costs to you as the employee... Tech salaries have been cushy for a while now, but we might be entering a "K-shaped" inflection point --> if you are an OpenAI elite researcher, then you might get a $100M+ offer from Meta; but if you are an average dev doing average enterprise CRUD, maybe your wages will be suppressed because the small cabal of LLM providers can raise prices and your company HAS to pay, which means you HAVE to bear the cost (or else what? you can quit and look for another job, but who's hiring?)

This is a pessimistic take of course (and vastly oversimplified / too cynical). A more positive outcome might be, that increasing quality of AI/LLM options leads to a democratization of talent, or a blossoming of "solo unicorns"... personally I have toyed with calling this, something like a "techno-Amish utopia", in the sense that Amish people believe in self-sufficiency and are not wholly-resistant to technology (it's actually quite clever, what sorts of technology they allow for themselves or not), so what if we could take that further?

If there was a version of that Amish-mentality of loosely-federated self-sufficient communities (they have newsletters! they travel to each other! but they largely feed themselves, build their own tools, fix their own fences, etc.!), where engineers + their chosen LLM partner could launch companies from home, manage their home automation / security tech, run a high-tech small farm, live off-grid from cheap solar, use excess electricity to Bitcoin mine if they choose to, etc.... maybe there is actually a libertarian world that can arise, where we are no longer as dependent on large institutions to marshal resources, deploy capital, scale production, etc., if some of those things are more in-reach for regular people in smaller communities, assisted by AI. This of course assumes that, the cabal of LLM model creators can be broken, that you don't need to pay for Claude if the cheaper open-source-ish Llama-like alternative is good enough

rapind · 8h ago

Well my business doesn't rely on AI as a competitive advantage, at least not yet anyways. So as it stands, if claude got 100x as effective, but cost 100x more, I'm not sure I could justify the cost because my market might just not be large enough. Which means I can either ditch it (for an alternative if one exists) or expand into other markets... which is appealing but a huge change from what I'm currently doing.

As usual, the answer is "it depends". I guarantee though that I'll at least start looking at alternatives when there's a huge price hike.

Also I suspect that a 100x improvement (if even possible) wouldn't just cost 100 times as much, but probably 100,000+ times as much. I also suspect than an improvement of 100x will be hyped as an improvement of 1,000x at least :)

Regardless, AI is really looking like a commodity to me. While I'm thankful for all the investment that got us here, I doubt anyone investing this late in the game at these inflated numbers are going to see a long term return (other than ponzi selling).

benreesman · 11h ago

Vibe coding is nonsense, and its really kind of uncomfortable to realize that a bunch of people you had tons of respect for are either ignorant or dishonest/bought enough to say otherwise. There's a cold wind blowing and the bunker-building crowd, well let's just say I won't shed a tear.

You don't stock antibiotics and bullets in a survival compound because you think that's going to keep out a paperclip optimizer gone awry. You do that in the forlorn hope that when the guillotines come out that you'll be able to ride it out until the Nouveau Regime is in a negotiating mood. But they never are.

sebzim4500 · 10h ago

>LLMs? Look more and more like the Metaverse every day as concerns the economics.

ChatGPT has 800M+ weekly active users how is that comparable to the Metaverse in any way?

benreesman · 10h ago

I said as concerns the economics. It's clearly more popular than the Oculus or whatever, but it's still a money bonfire and shows no signs of changing on that front.

threetonesun · 8h ago

LLMs as we know them via ChatGPT were a way to disrupt the search monopoly Google had for so many years. And my guess is the reason Google was in no rush to jump into that market was because they knew the economics of it sucked.

benreesman · 5h ago

Right, and inb4 ads on ChatGPT to stop the bleeding. That's the default outcome at this point: quantize it down gradually to the point where it can be ad supported.

You can just see the scene from the Sorkin film where Fidji is saying to Altman: "Its time to monetize the site."

"We don't even know what it is yet, we know that it is cool."

skeeter2020 · 6h ago

Th author is not bearish on LLMs at all; this post is about using LLMs and code vs. LLMs with autonomous tools via MCP. An example from your set would be translation. The author says you'll get better results if you do something like ask an LLM to translate documents, review the proposed approach, ask it to review it's work and maybe ask another LLM to validate the results than if you say "you've got 10K documents in English, and these tools - I speak French"

kibwen · 11h ago

No, 3D printers are the backbone of modern physical prototyping. They're far more important to today's global economy than LLMs are, even if you don't have the vantage point to see it from your sector. That might change in the future, but snapping your fingers to wink LLMs out of existence would change essentially nothing about how the world works today; it would be a non-traumatic non-event. There just hasn't been time to integrate them into any essential processes.

whiplash451 · 10h ago

> snapping your fingers to wink LLMs out of existence would change essentially nothing about how the world works today

One could have said the same thing about Google in 2006

kibwen · 10h ago

No, not even close. By 2006 all sorts of load-bearing infrastructure was relying on Google (e.g. Gmail). Today LLMs are still on the edge of important systems, rather than underlying those systems.

johnsmith1840 · 9h ago

Things like BERT are a load bearing structure in data science pipelines.

I assume there are massive number of LLM analysis pipelines out there.

I suppose it depends if you consider non determinist DS/ML pipelines "loadbearing" or not. Most are not using LLMs though.

3D parts regularly are used beyond prototyping though as tooling for a small company can be higher than just metal 3D parts. So I do somewhat agree but the loss of productivity in software prototyping would be a massive hit if LLMs vanished.

datameta · 11h ago

Without trying to take away from your assertion, I think it is worthwhile to mention that part of this phenomenon is the unavoidable matter of meatspace being expensive and dataspace being intangibly present everywhere.

nativeit · 10h ago

[citation needed]

deadbabe · 11h ago

large scale usage in niche domains is still small scale overall.

dingnuts · 12h ago

And yet you didn't provide a single reference link! Every case of LLM usage that I've seen claimed about those things has been largely a lie -- guess you won't take the opportunity to be the first to present a real example. Just another rumor.

whiplash451 · 12h ago

My reference is the daily usage of chatgpt around me (outside of tech circles).

I don’t want to sound like a hard-core LLM believer. I get your point and it’s fair.

I just wanted to point out that the current usage of chatgpt is a lot broader than that of 3D printers even at the peak hype of it.

dingnuts · 12h ago

Outside of tech circles it looks like NFTs: people following hype using tech they don't understand which will be popular until the downsides we're aware of that they are ignorant to have consequences, and then the market will reflect the shift in opinion.

basch · 11h ago

No way.

Everybody under a certain age is using ChatGPT, where they were once using search and friendship/expertises. It’s the number 1 app in the App Store. Copilot use in the enterprise is so seamless, you just talk to PowerPoint or outlook and it formulated what you were supposed to make or write.

It’s not a fad, it is a paradigm change.

People don’t need to understand how it works for it to work.

lotsoweiners · 7h ago

> It’s the number 1 app in the App Store.

When I checked the iOS App Store just now, something called Love Island USA is the #1 free app. Kinda makes you think….

dingnuts · 6h ago

I know it's popular; that doesn't mean it's not a fad. Consequences take time. It's easy to use but once you get burned in a serious way by the bot that's still wrong 20% of the time, you'll become more reluctant to put your coin in the slot machine.

Maybe if the AI companies start offering refunds for wrong answers, then the price per token might not be such a scam.

jrm4 · 10h ago

Not even remotely in the same universe; the difference is ChatGPT is actually having an impact, people are incorporating it day-to-day in a way that NFTs never stood much of a chance.

whiplash451 · 11h ago

I see it differently: people are switching to chatgpt like they switched to google back in 2005 (from whatever alternative existed back then)

And I mean random people, not tech circles

It’s very different from NFTs in that respect

retsibsi · 9h ago

Even if the most bearish predictions turn out to be correct, the comparison of LLMs to NFTs is a galaxy-spanning stretch.

NFTs are about as close to literally useless as it gets, and that was always obvious; 99% of the serious attention paid to them came from hustlers and speculators.

LLMs, for all their limitations, are already good at some things and useful in some ways. Even in the areas where they are (so far) too unreliable for serious use, they're not pure hype and bullshit; they're doing things that would have seemed like magic 10 years ago.

hk1337 · 11h ago

> Directionally I think this is right.

We have a term at work we use called, "directionally accurate", when it's not entirely accurate but headed in the right direction.

abdulhaq · 13h ago

this is a really good take

simonw · 12h ago

Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.

The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.

That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.

My assembly Mandelbrot experiment was the thing that made this click for me: https://simonwillison.net/2025/Jul/2/mandelbrot-in-x86-assem...

vunderba · 10h ago

> The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide, and how to define the success criteria for the model.

Your test case seems like a quintessential example where you're missing that last step.

Since it is unlikely that you understand the math behind fractals or x86 assembly (apologies if I'm wrong on this), your only means for verifying the accuracy of your solution is a superficial visual inspection, e.g. "Does it look like the Mandelbrot series?"

Ideally, your evaluation criteria would be expressed as a continuous function, but at the very least, it should take the form of a sufficiently diverse quantifiable set of discrete inputs and their expected outputs.

simonw · 8h ago

That's exactly why I like using Mandelbrot as a demo: it's perfect for "superficial visual inspection".

With a bunch more work I could likely have got a vision LLM to do that visual inspection for me in the assembly example, but having a human in the loop for that was much more productive.

shepherdjerred · 6h ago

Are fractals or x86 assembly representative of most dev work?

nartho · 6h ago

I think it's irrelevant. The point they are trying to make is anytime you ask a LLM for something that's outside of your area of expertise you have very little to no way to insure it is correct.

diggan · 5h ago

> anytime you ask a LLM for something that's outside of your area of expertise you have very little to no way to insure it is correct.

I regularly use LLMs to code specific functions I don't necessarily understand the internals of. Most of the time I do that, it's something math-heavy for a game. Just like any function, I put it under automated and manual tests. Still, I review and try to gain some intuition about what is happening, but it is still very far of my area of expertise, yet I can be sure it works as I expect it to.

chamomeal · 11h ago

That’s super cool, I’m glad you shared this!

I’ve been thinking about using LLMs for brute forcing problems too.

Like LLMs kinda suck at typescript generics. They’re surprisingly bad at them. Probably because it’s easy to write generics that look correct, but are then screwy in many scenarios. Which is also why generics are hard for humans.

If you could have any LLM actually use TSC, it could run tests, make sure things are inferring correctly, etc. it could just keep trying until it works. I’m not sure this is a way to produce understandable or maintainable generics, but it would be pretty neat.

Also while typing this is realized that cursor can see typescript errors. All I need are some utility testing types, and I could have cursor write the tests and then brute force the problem!

If I ever actually do this I’ll update this comment lol

chrisweekly · 9h ago

Giving LLMs the right context -- eg in the form of predefined "cognitive tools", as explored with a ton of rigor here^1 -- seems like the way forward, at least to this casual observer.

1. https://github.com/davidkimai/Context-Engineering/blob/main/...

(the repo is a WIP book, I've only scratched the surface but it seems pretty brilliant to me)

skeeter2020 · 6h ago

One of my biggest, ongoing challenges has been to get the LLM to use the tool(s) that are appropriate for the job. It feels like teach your kids to say, do laundry and you want to just tell them to step aside and let you do it.

nico · 11h ago

> LLM in a sandbox using tools in a loop, you can brute force that problem

Does this require using big models through their APIs and spending a lot of tokens?

Or can this be done either with local models (probably very slow), or with subscriptions like Claude Code with Pro (without hitting the rate/usage limits)?

I saw the Mandelbrot experiment, it was very cool, but still a rather small project, not really comparable to a complex/bigger/older code base for a platform used in production

simonw · 11h ago

The local models aren't quite good enough for this yet in my experience - the big hosted models (o3, Gemini 2.5, Claude 4) only just crossed the capability threshold for this to start working well.

I think it's possible we'll see a local model that can do this well within the next few months though - it needs good tool calling, not an encyclopedic knowledge of the world. Might be possible to fit that in a model that runs locally.

e12e · 7h ago

I wonder if common lisp with repl and debugger could provide a better tool than your example with nasm wrapped via apt in Docker...

Essentially just giving LLMs more state of the art systems made for incremental development?

Ed: looks like that sort of exists: https://github.com/bhauman/clojure-mcp

(Would also be interesting if one could have a few LLMs working together on red/green TDD approach - have an orchestrator that parse requirements, and dispatch a red goblin to write a failing test; a green goblin that writes code until the test pass; and then some kind of hobgoblin to refactor code, keeping test(s) green - working with the orchestrator to "accept" a given feature as done and move on to the next...

With any luck the resulting code might be a bit more transparent (stricter form) than other LLM code)?

nico · 10h ago

> it needs good tool calling, not an encyclopedic knowledge of the world

I wonder if there are any groups/companies out there building something like this

Would love to have models that only know 1 or 2 languages (eg. python + js), but are great at them and at tool calling. Definitely don't need my coding agent to know all of Wikipedia and translating between 10 different languages

johnsmith1840 · 9h ago

Given 2 datasets:

1. A special code dataset 2. A bunch of "unrelated" books

My understanding is that the model trained on just the first will never beat the model trained on both. Bloomberg model is my favorite example of this.

If you can squirell away special data then that special data plus everything else will beat the any other models. But that's basically what openai, google, and anthropic are all currently doing.

never_inline · 10h ago

Wasn't there a tool calling benchmark by docker guys which concluded qwen models are nearly as good as GPT? What is your experience about it?

Personally I am convinced JSON is a bad format for LLMs and code orchestration in python-ish DSL is the future. But local models are pretty bad at code gen too.

pxc · 11h ago

There's a fine-tune of Qwen3 4B called "Jan Nano" that I started playing with yesterday, which is basically just fine-tuned to be more inclined to look things up via web searches than to answer them "off the dome". It's not good-good, but it does seem to have a much lower effective hallucination rate than other models of its size.

It seems like maybe similar approaches could be used for coding tasks, especially with tool calls for reading man pages, info pages, running `tldr`, specifically consulting Stack Overflow, etc. Some of the recent small MoE models from Chinese companies are significantly smarter than models like Qwen 4B, but run about as quickly, so maybe on systems with high RAM or high unified memory, even with middling GPUs, they could be genuinely useful for coding if they are made to be avoid doing anything without tool use.

rasengan · 12h ago

Makes sense.

I treat an LLM the same way I'd treat myself as it relates to context and goals when working with code.

"If I need to do __________ what do I need to know/see?"

I find that traditional tools, as per the OP, have become ever more powerful and useful in the age of LLMs (especially grep).

Furthermore, LLMs are quite good at working with shell tools and functionalities (heredoc, grep, sed, etc.).

dist-epoch · 12h ago

I've been using a VM for a sandbox, just to make sure it won't delete my files if it goes insane.

With some host data directories mounted read only inside the VM.

This creates some friction though. Feels like a tool which runs the AI agent in a VM, but then copies it's output to the host machine after some checks would help, so that it would feel that you are running it natively on the host.

jitl · 12h ago

This is very easy to do with Docker. Not sure it you want the vm layer as an extra security boundary, but even so you can just specify the VM’s docker api endpoint to spawn processes and copy files in/out from shell scripts.

simonw · 11h ago

Have you tried giving the model a fresh checkout in a read-write volume?

dist-epoch · 11h ago

Hmm, excellent idea, somehow I assumed that it would be able to do damage in a writable volume, but it wouldn't be able to exit it, it would be self-contained to that directory.

antirez · 12h ago

I have the feeling that's not really MCP specifically VS other ways, but it is pretty simply: at the current state of AI, to have a human in the loop is much better. LLMs are great at certain tasks but they often get trapped into local minima, if you do the back and forth via the web interface of LLMs, ask it to write a program, look at it, provide hints to improve it, test it, ..., you get much better results and you don't cut yourself out to find a 10k lines of code mess that could be 400 lines of clear code. That's the current state of affairs, but of course many will try very hard to replace programmers that is currently not possible. What it is possible is to accelerate the work of a programmer several times (but they must be good both at programming and LLM usage), or take a smart person that has a relatively low skill in some technology, and thanks to LLM make this person productive in this field without the long training otherwise needed. And many other things. But "agentic coding" right now does not work well. This will change, but right now the real gain is to use the LLM as a colleague.

It is not MCP: it is autonomous agents that don't get feedbacks from smart humans.

rapind · 11h ago

So I run my own business (product), I code everything, and I use claude-code. I also wear all the other hats and so I'd be happy to let Claude handle all of the coding if / when it can. I can confirm we're certainly not there yet.

It's definitely useful, but you have to read everything. I'm working in a type-safe functional compiled language too. I'd be scared to try this flow in a less "correctness enforced" language.

That being said, I do find that it works well. It's not living up to the hype, but most of that hype was obvious nonsense. It continues to surprise me with it's grasp on concepts and is definitely saving me some time, and more importantly making some larger tasks more approachable since I can split my time better.

mritchie712 · 14h ago

> try completing a GitHub task with the GitHub MCP, then repeat it with the gh CLI tool. You'll almost certainly find the latter uses context far more efficiently and you get to your intended results quicker.

This is spot on. I have a "devops" folder with a CLAUDE.md with bash commands for common tasks (e.g. find prod / staging logs with this integration ID).

When I complete a novel task (e.g. count all the rows that were synced from stripe to duckdb) I tell Claude to update CLAUDE.md with the example. The next time I ask a similar question, Claude one-shots it.

This is the first few lines of the CLAUDE.md

    This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

    ## Purpose
    This devops folder is dedicated to Google Cloud Platform (GCP) operations, focusing on:
    - Google Cloud Composer (Airflow) DAG management and monitoring
    - Google Cloud Logging queries and analysis
    - Kubernetes cluster management (GKE)
    - Cloud Run service debugging

    ## Common DevOps Commands

    ### Google Cloud Composer
    ```bash
    # View Composer environment details
    gcloud composer environments describe meltano --location us-central1 --project definite-some-id

    # List DAGs in the environment
    gcloud composer environments storage dags list --environment meltano --location us-central1 --project definite-some-id

    # View DAG runs
    gcloud composer environments run meltano --location us-central1 dags list

    # Check Airflow logs
    gcloud logging read 'resource.type="cloud_composer_environment" AND resource.labels.environment_name="meltano"' --project definite-some-id --limit 50

jayd16 · 12h ago

I feel like I'm taking crazy pills sometimes. You have a file with a set of snippets and you prefer to ask the AI to hopefully run them instead of just running it yourself?

lreeves · 12h ago

The commands aren't the special sauce, it's the analytical capabilities of the LLM to view the outputs of all those commands and correlate data or whatever. You could accomplish the same by prefilling a gigantic context window with all the logs but when the commands are presented ahead of time the LLM can "decide" which one to run based on what it needs to do.

mritchie712 · 12h ago

the snippets are examples. You can ask hundreds of variations of similar, but different, complex questions and the LLM can adjust the example for that need.

I don't have a snippet for, "find all 500's for the meltano service for duckdb syntax errors", but it'd easily nail that given the existing examples.

dingnuts · 11h ago

but if I know enough about the service to write examples, most of the time I will know the command I want, which is less typing, faster, costs less, and doesn't waste a ton of electricity.

In the other cases I see what the computer outputs, LEARN, and then the functionality of finding what I need just isn't useful next time. Next time I just type the command.

I don't get it.

loudmax · 10h ago

LLMs are really good at processing vague descriptions of problems and offering a solution that's reasonably close to the mark. They can be a great guide for unfamiliar tools.

For example, I have a pretty good grasp of regular expressions because I'm an old Perl programmer, but I find processing json using `jq` utterly baffling. LLMs are great at coming up with useful examples, and sometimes they'll even get it perfect the first time. I've learned more about properly using `jq` with the help of LLMs than I ever did on my own. Same goes for `ffmpeg`.

LLMs are not a substitute for learning. When used properly, they're an enhancement to learning.

Likewise, never mind the idiot CEOs of failing companies looking forward to laying off half their workforce and replacing them with AI. When properly used, AI is a tool to help people become more productive, not replace human understanding.

qazxcvbnmlp · 8h ago

You dont ask the ai to run the commands. you say "build and test this feature" and then the AI correctly iterates back and forth between the build and test commands until the thing works.

light_hue_1 · 12h ago

Yes. I'm not the poster but I do something similar.

Because now the capabilities of the model grow over time. And I can ask questions that involve a handful of those snippets. When we get to something new that requires some doing, it becomes another snippet.

I can offload everything I used to know about an API and never have to think about it again.

lsaferite · 14h ago

Just as a related aside, you could literally make that bottom section into a super simple stdio MCP Server and attach that to Claude Code. Each of your operations could be a tool and have a well-defined schema for parameters. Then you are giving the LLM a more structured and defined way to access your custom commands. I'm pretty positive there are even pre-made MCP Servers that are designed for just this activity.

Edit: First result when looking for such an MCP Server: https://github.com/inercia/MCPShell

gbrindisi · 13h ago

wouldn't this defeat the point? Claude Code already has access to the terminal, adding specific instruction in the context is enough

lsaferite · 12h ago

No. You are giving textual instructions to Claude in the hopes that it correctly generates a shell command for you vs giving it a tool definition with a clearly defined schema for parameters and your MCP Server is, presumably, enforcing adherence to those parameters BEFORE it hits your shell. You would be helping Claude in this case as you're giving a clearer set of constraints on operation.

wrs · 11h ago

Well, with MCP you’re giving textual instructions to Claude in hopes that it correctly generates a tool call for you. It’s not like tool calls have access to some secret deterministic mode of the LLM; it’s still just text.

To an LLM there’s not much difference between the list of sample commands above and the list of tool commands it would get from an MCP server. JSON and GNU-style args are very similar in structure. And presumably the command is enforcing constraints even better than the MCP server would.

lsaferite · 5h ago

Not strictly true. The LLM provider should be running a constrained token selection based off of the json schema of the tool call. That alone makes a massive difference as you're already discarding non-valid tokens during the completion at a low level. Now, if they had a BNF Grammer for each cli tool and enforced token selection based on that, you'd be much better off than unrestrained token selection.

fassssst · 11h ago

Either way it is text instructions used to call a function (via a JSON object for MCP or a shell command for scripts). What works better depends on how the model you’re using was post trained and where in the prompt that info gets injected.

chriswarbo · 11h ago

I use a similar file, but just for myself (I've never used an LLM "agent"). I live in Emacs, but this is the only thing I use org-mode for: it lets me fold/unfold the sections, and I can press C-c C-c over any of the code snippets to execute it. Some of them are shell code, some of them are Emacs Lisp code which generates shell code, etc.

stpedgwdgfhgdd · 10h ago

I do something similar, but the problem is that claude.md keeps on growing.

To tackle this, I converted a custom prompt into an application, but there is an interesting trade-off. The application is deterministic. It cannot deal with unknown situations. In contrast to CC, which is way slower, but can try alternative ways of dealing with an unknown situation.

I ended up with adding an instruction to the custom command to run the application and fix the application code (TDD) if there is a problem. Self healing software… who ever thought

e12e · 7h ago

You're letting the LLM execute privileged API calls against your production/test/staging environment, just hoping it won't corrupt something, like truncate logs, files, databases etc?

Or are you asking it to provide example commands that you can sanity check?

I'd be curious to see some more concrete examples.

galdre · 12h ago

My absolute favorite use of MCP so far is Bruce Hauman's clojure-mcp. In short, it gives the LLM (a) a bash tool, (b) a persistent Clojure REPL, and (c) structural editing tools.

The effect is that it's far more efficient at editing Clojure code than any purely string-diff-based approach, and if you write a good test suite it can rapidly iterate back and forth just editing files, reloading them, and then re-running the test suite at the REPL -- just like I would. It's pretty incredible to watch.

chamomeal · 9h ago

I was just going to comment about clojure-mcp!! It’s far and away the coolest use of mcp I’ve seen so far.

It can straight up debug your code, eval individual expressions, document return types of functions. It’s amazing.

It actually makes me think that languages with strong REPLs are a better for languages than those without. Seeing clojure-mcp do its thing is the most impressive AI feat I’ve seen since I saw GPT-3 in action for the first time

e12e · 7h ago

https://github.com/bhauman/clojure-mcp

victorbjorklund · 12h ago

I think the GitHub CLI example isn't entirely fair to MCP. Yes, GitHub's CLI is extensively documented online, so of course LLMs will excel at generating code for well-known tools. But MCP shines in different scenarios.

Consider internal company tools or niche APIs with minimal online documentation. Sure, you could dump all the documentation into context for code generation, but that often requires more context than interacting with an MCP tool. More importantly, generated code for unfamiliar APIs is prone to errors so you'd need robust testing and retry mechanisms built in to the process.

With MCP, if the tools are properly designed and receive correct inputs, they work reliably. The LLM doesn't need to figure out API intricacies, authentication flows, or handle edge cases - that's already handled by the MCP server.

So I agree MCP for GitHub is probably overkill but there are many legitimate use cases where pre-built MCP tools make more sense than asking an LLM to reverse-engineer poorly documented or proprietary systems from scratch.

the_mitsuhiko · 12h ago

> Sure, you could dump all the documentation into context for code generation, but that often requires more context than interacting with an MCP tool.

MCP works exactly that way: you dump documentation into the context. That's how the LLM knows how to call your tool. Even for custom stuff I noticed that giving the LLM things to work with that it knows (eg: python, javascript, bash) beats it using MCP tool calling, and in some ways it wastes less context.

YMMV, but I found the limit of tools available to be <15 with sonnet4. That's a super low amount. Basically the official playwright MCP alone is enough to fully exhaust your available tool space.

JyB · 10h ago

Ive never used that many. The LLM performances collapse/degrade significantly because of too much initial context? It seems like MCP implems updates could easily solve that. Like only injecting relevant servers for the given task based on initial user prompt.

the_mitsuhiko · 8h ago

> Ive never used that many.

The playwright MCP alone introduces 25 tools into the context :(

light_hue_1 · 12h ago

That's handled by the MCP server in the sense of it doesn't do authentication, etc. it provides a simplified view of the world.

If that's what you wanted you could have designed that as your poorly documented internal API differently to begin with. There's zero advantage to MCP in the scenario you describe aside from convincing people that their original API is too hard to use.

pamelafox · 10h ago

Regarding the Playwright example: I had the same experience this week attempting to build an agent first by using the Playwright MCP server, realizing it was slow, token-inefficient, and flaky, and rewriting with direct Playwright calls.

MCP servers might be fun to get an idea for what's possible, and good for one-off mashups, but API calls are generally more efficient and stable, when you know what you want.

Here's the agent I ended up writing: https://github.com/pamelafox/personal-linkedin-agent

Demo: https://www.youtube.com/live/ue8D7Hi4nGs

arkmm · 8h ago

This is cool! Also have found the Playwright MCP implementation to be overkill and think of it more as a reference to an opinionated subset of the Playwright API.

LinkedIn has this reputation of being notorious about making it hard to build automations on top of, did you run into any roadblocks when building your personal LinkedIn agent?

zahlman · 6h ago

... Ah, reading these as well as more carefully reading TFA, I understand now that there is an MCP based on Playwright, and that Playwright itself is not considered an example of something that accidentally is an MCP despite having been released all the way back in January 2020.

... But now I still feel far away from understanding what MCP really is. As in:

* What specifically do I have to implement in order to create one?

* Now that the concept exists, what are the implications as the author of, say, a traditional REST API?

* Now that the concept exists, what new problems exist to solve?

novoreorx · 1h ago

Looks familiar, it seems to share some ideas with this one: LLM function calls dont scale, code orchestration is simpler, more effective

Source: https://jngiam.bearblog.dev/mcp-large-data/ HN: https://news.ycombinator.com/item?id=44053744

jumploops · 12h ago

We’re playing an endless cat and mouse game of capabilities between old and new right now.

Claude Code shows that the models can excel at using “old” programmatic interfaces (CLIs) to do Real Work™.

MCP is a way to dynamically provide “new” programmatic interfaces to the models.

At some point this will start to converge, or at least appear to do so, as the majority of tools a model needs will be in its pre-training set.

Then we’ll argue about MPPP (model pre-training protocol pipeline), and how to reduce knowledge pollution of all the LLM-generated tools we’re passing to the model.

Eventually we’ll publish the Merrium-Webster Model Tool Dictionary (MWMTD), surfacing all of the approved tools hidden in the pre-training set.

Then the kids will come up with Model Context Slang (MCS), in an attempt to use the models to dynamically choose unapproved tools, for much fun and enjoyment.

Ad infinitum.

mindwok · 14h ago

More appropriately: the terminal is all you need.

I have used MCP daily for a few months. I'm now down to a single MCP server: terminal (iTerm2). I have OpenAPI specs on hand if I ever need to provide them, but honestly shell commands and curl get you pretty damn far.

jasonthorsness · 13h ago

I never knew how far it was possible to go in bash shell with the built-in tools until I saw the LLMs use them.

zahlman · 6h ago

Possibly because most people who could mentor you, would give up and switch to their preference of {Perl, Python, Ruby, PHP, ...} far earlier.

(Check out Dave Eddy, though. https://github.com/bahamas10 ; also occasionally streams on YouTube and then creates short educational video content there: https://www.youtube.com/@yousuckatprogramming )

skydhash · 2h ago

Spurred by macos switch to zsh, I've gone all in on zsh customization (with oh-my-zsh). But since I've moved to Linux, it's been bash daily (with a brief fish interlude). Everything under the sun works with bash.

sakesun · 1h ago

Actually, this is the way human brain work. It's what we now know as system-1 (automatic, unconscious) and system-2 (effortful, conscious) described in Thinking Fast and Slow book.

JyB · 12h ago

> It demands too much context.

This is solved trivially by having default initial prompts. All majors tools like Claude Code or Gemini CLI have ways to set them up.

> You pass all your tools to an LLM and ask it to filter it down based on the task at hand. So far, there hasn't been much better approaches proposed.

Why is a "better" approach needed? If modern LLMs can properly figure it out? It's not like LLMs don't keep getting better with larger and larger context length. I never had a problem with an LLM struggling to use the appropriate MCP function on it's own.

> But you run into three problems: cost, speed, and general reliability

- cost: They keep getting cheaper and cheaper. It's ridiculously inexpensive for what those tools provide.

- speed: That seem extremely short sighted. No one is sitting idle looking at Claude Code in their terminal. And you can have more than one working on unrelated topics. That defeats the purpose. No matter how long it takes the time spent is purely bonus. You don't have to spend time in the loop when asking well defined tasks.

- reliability: Seem very prompt correlated ATM. I guess some people don't know what to ask which is the main issue.

Having LLMS being able to complete tedious tasks involving so many external tools at once is simply amazing thanks to MCP. Anecdotal but just today it did a task flawlessly involving: Notion pages, Linear Ticket, git, GitHub PR, GitHub CI logs. Being in the loop was just submitting one review on the PR. All the while I was busy doing something else. And for what, ~1$?

the_mitsuhiko · 12h ago

> This is solved trivially by having default initial prompts. All majors tools like Claude Code or Gemini CLI have ways to set them up.

That only makes it worse. The MCP tools available all add to the initial context. The more tools, the more of the context is populated by MCP tool definitions.

JyB · 12h ago

Do you mean that some tools (MCP clients) pass all functions of all configured MCP servers in the initial prompt?

If that's the case: I understand the knee-jerk reaction but if it works? Also what theoretically prevents altering the prompt chaining logic in these tools to only expose a condensed list of MCP servers, not their whole capabilities, and only inject details based on LLM outputs? Doesn't seem like an insurmountable problem.

the_mitsuhiko · 11h ago

> Do you mean that some tools (MCP clients) pass all functions of all configured MCP servers in the initial prompt?

Not just some, all. That's just how MCP works.

> If that's the case: I understand the knee-jerk reaction but if it works?

I would not be writing about this if it worked well. The data indicates that it worse significantly worse than not using MCP because of the context rot, and the low too utilization.

JyB · 11h ago

I guess I don't see the technical limitation. Seem like a protocol update issue.

dingnuts · 12h ago

> cost: They keep getting cheaper and cheaper

no they don't[0], the cost is just still hidden from you but the freebies will end just like MoviePass and cheap Ubers

https://bsky.app/profile/edzitron.com/post/3lsw4vatg3k2b

"Cursor released a $200-a-month subscription then made their $20-a-month subscription worse (worse output, slower) - yet it seems even on Max they're rate limiting people!"

https://bsky.app/profile/edzitron.com/post/3lsw3zwgw4c2h

fkyoureadthedoc · 11h ago

The cost will stay hidden from me because my job will pay it, just like the cost of my laptop, o365 license, and every other tool I use at work.

nativeit · 10h ago

Until they use your salary to pay for another dozen licenses.

JyB · 9h ago

Fair. I'm using Claude Code which is pay as you go. The Market will probably do its things. (The company pays anyway obviously)

elyase · 13h ago

This is similar to the tool call (fixed code & dynamic params) vs code generation (dynamic code & dynamic params) discussion: tools offer contrains and save tokens, code gives you flexibility. Some papers suggest that generating code is often superior and this will likely become even more true as language models improve

[1] https://huggingface.co/papers/2402.01030

[2] https://huggingface.co/papers/2401.00812

[3] https://huggingface.co/papers/2411.01747

I am working on a model that goes a step beyond and even makes the distinction between thinking and code execution unnecessary (it is all computation in the end), unfortunately no link to share yet

SatvikBeri · 6h ago

I use Julia at work, which benefits from long-running sessions, because it compiles functions the first time they run. So I wrote a very simple MCP that lets Claude Code send code to a persistent Julia kernel using Jupyter.

It had a much bigger impact than I expected – not only does test code run much faster (and not time out), but Claude seems to be much more willing to just run functions from our codebase rather than do a bunch of bespoke bash stuff to try and make something work. It's anecdotal, but CCUsage says my token usage has dropped nearly 50% since I wrote the server.

Of course, it didn't have to be MCP – I could have used some other method to get Claude to run code from my codebase more frequently. The broader point is that it's much easier to just add a useful function to my codebase than it is to write something bespoke for Claude.

macleginn · 5h ago

"Claude seems to be much more willing to just run functions from our codebase rather than do a bunch of bespoke bash stuff to try and make something work" -- simply because it knows that there is a kernel it can send code to?

jrm4 · 10h ago

Yup, I can't help but think that a lot of the bad thinking comes from trying to avoid the following fact: LLMs are only good where your output does not need to be precise and/or verifiably "perfect," which is kind of the opposite of how code has worked, or has tried to work, in the past.

Right now I got it for: DRAFTS of prose things -- and the only real killer in my opinion, autotagging thousands of old bookmarks. But again, that's just to have cool stuff to go back and peruse, not something that must be correc.t

never_inline · 10h ago

The problem I see with MCP is very simple. It's using JSON as the format and that's nowhere as expressive as a programming language.

Consider a python function signature

list_containers(show_stopped: bool = False, name_pattern: Optional[str] = None, sort: Literal["size", "name", "started_at"] = "name"). It doesn't even need docs

Now convert this to JSON schema which is 4x larger input already.

And when generating output, the LLM will generate almost 2x more tokens too, because JSON. Easier to get confused.

And consider that the flow of calling python functions and using their output to call other tools etc... is seen 1000x more times in their fine tuning data, whereas JSON tool calling flows are rare and practically only exist in instruction tuning phase. Then I am sure instruction tuning also contains even more complex code examples where model has to execute complex logic.

Then theres the whole issue of composition. To my knowledge there's no way LLM can do this in one response.

    vehicle = call_func_1()
    if vehicle.type == "car":
      details = lookup_car(vehicle.reg_no)
    else if vehicle.type == "motorcycle":
      details = lookup_motorcycle(vehicle.reg_ni)

How is JSON tool calling going to solve this?

8note · 9h ago

the reason to use the llm is that you dont know ahead of time that the vehicle type is only a car or motorcycle, and the llm will also figure out a way to detail bycicles and boats and airplanes, and to consider both left and right shoes separately.

the llm cant just be given this function because its specialized to just the two options.

you could have it do a feedback loop of rewriting the python script after running it, but whats the savings at tha point? youre wasting tokens talking about cars in python when you already know is a ski, and the llm could ask directly for the ski details without writing a script to do it in between

chrisweekly · 9h ago

Great point.

But "the" problem with MCP? IMVHO (Very humble, non-expert) the half-baked or missing security aspects are more fundamental. I'd love to hear updates about that from ppl who know what they're talking about.

CharlieDigital · 10h ago

    > So maybe we need to look at ways to find a better abstraction for what MCP is great at, and code generation. For that that we might need to build better sandboxes and maybe start looking at how we can expose APIs in ways that allow an agent to do some sort of fan out / fan in for inference. Effectively we want to do as much in generated code as we can, but then use the magic of LLMs after bulk code execution to judge what we did.

Por que no los dos? I ended up writing an OSS MCP server that securely executes LLM generated JavaScript using a C# JS interpreter (Jint) and handing it a `fetch` analogue as well as `jsonpath-plus`. Also gave it a built-in secrets manager.

Give it an objective and the LLM writes its own code and uses the tool iteratively to accomplish the task (as long as you can interact with it via a REST API).

For well known APIs, it does a fine job generating REST API calls.

You can pretty much do anything with this.

https://github.com/CharlieDigital/runjs

graerg · 13h ago

> This is a significant advantage that an MCP (Multi-Component Pipeline) typically cannot offer

Oh god please no, we must stop this initialism. We've gone too far.

the_mitsuhiko · 13h ago

It's the wrong acronym. I wrote this blog post on the bike and used an LLM to fix up the dictation that I did. While I did edit it heavily and rewrote a lot of things, I did not end up noticing that my LLM expanded MCP incorrectly. It's Model Context Protocol.

apgwoz · 12h ago

And you shipped it to production. Just like real agentic coding! Nice!

the_mitsuhiko · 12h ago

Which I don't feel great about because I do not like to use LLMs for writing blog posts. I just really wanted to explore if I can write a blog post on my bike commute :)

bitwize · 12h ago

We're all in line to get de-rezzed by the MCP, one way or another.

luckystarr · 11h ago

I always dreamed of a tool which would know the intent, semantic and constraints of all inputs and outputs of any piece of code and thus could combine these code pieces automatically. It was always a fuzzy idea in my head, but this piece now made it a bit more clear. While LLMs could generate those adapters between distinct pieces automatically, it's a expensive (latency, tokens) process. Having a system with which not only to type the variables, but also to type the types (intents, semantic meaning, etc.) would be helpful but likely not sufficient. There has been so much work on ontologies, semantic networks, logical inference, etc. but all of it is spread all over the place. I'd like to have something like this integrated into a programming language and see what it feels like.

khalic · 11h ago

Honestly, I’m getting tired of these sweeping statements about what developers are supposed to be, how it’s “the right way to use AI”. We are in uncharted territories that are changing by the day. Maybe we have to drop the self-assurance and opinionated view points and tackle this like a scientific problem.

pizzathyme · 11h ago

100% agreed - he mentions 3 barriers to using MCP over code: "cost, speed, and general reliability". But all 3 of these could change by 10-100x within a few years, if not months. Just recently OpenAI dropped the price of using o3 by 80%

This is not an environment where you can establish a durable manifesto

tristanz · 11h ago

You can combine MCPs within composable LLM generated code if you put in a little work. At Continual (https://continual.ai), we have many workflows that require bulk actions, e.g. iterating over all issues, files, customers, etc. We inject MCP tools into a sandboxed code interpreter and have the agent generate both direct MCP tool calls and composable scripts that leverage MCP tools depending on the task complexity. After a bunch of work it actually works quite well. We are also experimenting with continual learning via a Voyager like approach where the LLM can save tool scripts for future use, allowing lifelong learning for repeated workflows.

JyB · 10h ago

That autocompounding aspect of constantly refining initial prompts with more and more knowledge is so interesting. Gut feeling says it’s something that will be “standardized” in some way, exactly like what MCP did.

tristanz · 10h ago

Yes, I think you could get quite far with a few tools like memory/todo list + code interpreter + script save/load. You could probably get a lot farther though if you RLVRed this similar to how o3 uses web search so effectively during it's thinking process.

pramodbiligiri · 10h ago

Wouldn't the sweet spot for MCP be where the LLM is able to do most of the heavy lifting on its own (outputting some kind of structured or unstructured output), but needs a bit of external/dynamic data that it can't do without? The list of MCP servers/tools it can use should nail that external lookup in a (mostly) deterministic way.

This would work best if a human is the end consumer of this output, or will receive manual vetting eventually. I'm not sure I'd leave such a system running unsupervised in production ("the Automation at Scale" part mentioned by the OP).

ramoz · 10h ago

You don't solve the problem of being able to rely on the agent to call the MCP.

Hooks into the agent's execution lifecycle seem more reliable for deterministic behavior and supervision.

pramodbiligiri · 9h ago

I agree. In any large backend software running on a server, it's the LLM invocation which would be a call out to an external system, and with proper validation around the results. At which point, calling an "MCP Server" is also just your backend software invoking one more library/service based on inspecting some part of the response from the LLM.

This doesn't take away from the utility of MCP when it comes Claude Desktop and the likes!

jasonthorsness · 13h ago

Tools are constraints and time/token savers. Code is expensive in terms of tokens and harder to constrain in environments that can’t be fully locked-down because network access for example is needed by the task. You need code AND tools.

blahgeek · 13h ago

> Code is expensive in terms of tokens and harder to constrain in environments

It's also true for human. But then we invented functions / libraries / modules

LudwigNagasena · 7h ago

I hit the same roadblock with MCP. If you work with data, LLM becomes a very expensive pipe with an added risk of hallucinations. It’s better to simply connect it to a Python environment enriched with integrations you need.

wrs · 11h ago

MCP is literally the same as giving an LLM a set of man page summaries and a very limited shell over HTTP. It’s just in a different syntax (JSON instead of man macros and CLI args).

It would be better for MCP to deliver function definitions and let the LLM write little scripts in a simple language.

aussieguy1234 · 1h ago

If it can be done on the command line , I'll just give the Agent permission to run commands.

But if I want the Agent to do something that can't be done with commands i.e. go into Google Docs and organize all my documents, then an MVP server would make sense.

vasusen · 12h ago

I think the Playwright MCP is a really good example of the overall problem that the author brings up.

However, I couldn’t really understand if he’s saying that the Playwright MCP is good to use for your own app or whether he means for your own app just tell the LLM directly to export Playwright code.

shelajev · 12h ago

it's the latter: "you can actually start telling it to write a Playwright Python script instead and run that".

and while running the code might faster, it's unclear whether that approach scales well. Sending an MCP tool command to click the button that says "X", is something a small local LLM can do. Writing complex code after parsing significant amount of HTML (for correct selectors for example) probably needs a managed model.

briandw · 10h ago

Anyone else switch their LLM subscription every month? I'm back on ChatGPT for O3 use, but expect that Grok4 will be next.

recursivedoubts · 12h ago

I would like to see MPC integrate the notion of hypermedia controls.

Seems like that would be a potential way to get self-organizing integrations.

vidarh · 13h ago

I frankly use tools mostly as an auth layer for things were raw access is too big a footgun without a permissions step. So I give the agent the choice of asking for permission to do things via the shell, or going nuts without user-interaction via a tool that enforces reasonable limitations.

Otherwise you can e.g just give it a folder of preapproved scripts to run and explain usage in a prompt.

manaskarekar · 11h ago

Off topic: That font/layout/contrast on the page is very pleasing and inviting.

prairieroadent · 9h ago

makes sense and if realized then deno is in excellent position to be one of the leading if not the main sandbox runtime for agents

FrustratedMonky · 12h ago

I wonder if having 2 LLM's communicate will eventually be more like humans talking. With all the same problems.

CuriouslyC · 11h ago

I already have agents managing different repositories ask each other questions and make requests. It works pretty well for the most part.

empath75 · 13h ago

The problem with this is that you have to give your LLM basically unbounded access to everything you have access to, which is a recipe for pain.

the_mitsuhiko · 13h ago

Not necessarily. I have a small little POC agentic tool on my side which is fully sandboxed, an it's inherently "non prompt injectable" by the data that it processes since it only ever passes that data through generated code.

Disclaimer: it does not work well enough. But I think it shows great promise.

webdevver · 13h ago

what does "MCP" stand for?

apgwoz · 12h ago

Mashup Context Protocol, of course! There was a post the other day comparing MCP tools to the mashups of web 2.0. It’s a much better acronym expansion.

aidenn0 · 11h ago

I didn't know either, but in the very first sentence, the author provides the expansion and a link to the Wikipedia page for it.

empath75 · 13h ago

If you don't know what it is and can't be bothered to google it, then you probably aren't the audience for this.

dangus · 13h ago

I was about to say the same thing.

It’s bad writing practice to do this, even if you are assuming your followers are following you.

Especially for a site like Twitter that has a login wall.

jasonlotito · 12h ago

I'm confused. It links to the definition in the first sentence, and I'm not sure what you mean by Twitter in this context.

komali2 · 13h ago

Microsoft Certified Professional, a very common certification.

Oh wait... hm ;) perhaps the writing nerds had it right when they recommend always writing the full acronym out the first time it's used in an article, no matter how common one presumes it to be

alganet · 6h ago

It's finally happening. The acceleration of the full AI disillusionment:

- LLMs will do everything.

- Shit, they won't. I'll do some traditional programming to put it on a leash.

- More traditional programming.

- Wait, this traditional programming thing is quite good.

- I barely use LLMs now.

- Man, that LLM stuff was a bad trip.

See you all on the other side!

meowface · 30m ago

You very much misunderstood the article (or, more likely, didn't even open it) because it fit your increasingly fringe skeptic narrative.

keybored · 9h ago

tl;dr of one of today’s AI posts: all you need is code generation

It’s 2025 and this is the epitome of progress.

On the positive side code generation can be solid if you also have/can generate easy-to-read validation or tests for the generated code. I mean that you can read, of course.

forrestthewoods · 12h ago

Unpopular Opinion: I hate Bash. Hate it. And hate the ecosystem of Unix CLIs that are from the 80s and have the most obtuse, inscrutable APIs ever designed. Also this ecosystem doesn’t work on Windows — which, as a game dev, is my primary environment. And no, WSL does not count.

I don’t think the world needs yet another shell scripting language. They’re all pretty mediocre at best. But maybe this is an opportunity to do something interesting.

Python environment is a clusterfuck. Which UV is rapidly bringing into something somewhat sane. Python isn’t the ultimate language. But I’d definitely be more interested in “replace yourself with a UV Python script” over “replace yourself with a shell script”. Would be nice to see use this as an opportunity to do better than Bash.

I realize this is unpopular. But unpopular doesn’t mean wrong.

zahlman · 5h ago

> Python environment is a clusterfuck. Which UV is rapidly bringing into something somewhat sane.

Uv is able to do what it does mainly because of a) being a greenfield project b) in an environment of new standards that the community has been working on since the first days that people complained about said clusterfuck.

But that's assuming you actually need to set up an environment. People really underestimate what can be done easily with just the standard library. And when they do grab the most popular dependencies, they end up exercising a tiny fraction of that code.

> But I’d definitely be more interested in “replace yourself with a UV Python script” over “replace yourself with a shell script”.

There is no such thing as "a UV Python script". Uv doesn't create a new language. It doesn't even have a monopoly on what I guess you're referring to, i.e. the system it uses for specifying dependencies inline in a script. That comes from an ecosystem-wide standard, https://peps.python.org/pep-0723/. Pipx also implements creating environments for such code and running it, as do Hatch and PDM; and other tools offer appropriate support - e.g. editors may be able to syntax-highlight the declaration etc.

Regardless, what you describe is not at all opposed to what the author has in mind here. The term "shell script" is often used quite loosely.

forrestthewoods · 4h ago

Ok?

lsaferite · 12h ago

Python CAN be a "shell script" in this case though...

Tool composition over stdio will get you very very far. That's what an interface "from the 80s" does for you 45 years later. That same stdio composability is easily pipe into/through any number of cli tools written in any number of languages, compiled and interpreted.

forrestthewoods · 11h ago

Composing via stdio is so bloody terrible. Layers and layers of bullshit string parsing and encoding and decoding. Soooo many bugs. And utterly undebuggable. A truly miserable experience.

zahlman · 5h ago

And now you also understand many of the limitations LLMs have.

hollerith · 12h ago

Me, too. Also, Unix as a whole is overrated. One reason it won was an agreement mediated by a Federal judge presiding over an anti-trust trial that AT&T would not enter the computer market while IBM would not enter the telecommunications market, so Unix was distributed at zero cost rather than sold.

Want to get me talking reverentially about the pioneers of our industry? Talk to me about Doug Engelbart, Xerox PARC and the Macintosh team at Apple. There was some brilliant work!

nativeit · 10h ago

> Also, Unix as a whole is overrated. One reason it won was an agreement mediated by a Federal judge presiding over an anti-trust trial that AT&T would not enter the computer market while IBM would not enter the telecommunications market, so Unix was distributed at zero cost rather than sold.

What did Unix win?

hollerith · 10h ago

Mind share of the basic design. Unix's design decisions are important parts of MacOS and Linux.

Multics would be an example of a more innovative OS than Unix, but its influence on the OSes we use today has been a lot less.

nativeit · 7h ago

I suppose the deeper question I'd have would be, how would its no-cost distribution prevent better alternatives from being developed/promoted/adopted along the way? I guess I don't follow your line of logic. To be fair, I'm not experienced enough with either OS development nor any notable alternatives to Unix to agree/disagree with your conclusions. My intuition wants to disagree, only because I like Linux, and even sort of like Bash scripts--but I have nothing but my own subjective preferences to base that position on, and I'm actually quite open to being better-informed into submission. ;-)

I'm a pretty old hat with Debian at this point, so I've got plenty of opinions for its contemporary implementations, but I always sort of assumed most of the fundamental architectural/systems choices had more or less been settled as the "best choices" via the usual natural selection, along with the OSS community's abiding love for reasoned debate. I can generally understand the issues folks have with some of these defaults, but my favorite aspect of OS's like Debian are that they generally defer to the sysadmin's desires for all things where we're likely to have strong opinions. It's "default position" of providing no default positions. Certainly now that there are containers and orchestration like Nix, the layer that is Unix is even less visible, and infrastructure-as-code mean a lot of developers can just kind of forget about the OS layer altogether, at least beyond the OS('s) they choose for their own daily driver(s).

Getting this back to the OG point--I can understand why people don't like the Bash scripting language. But it seems trivial these days to get to a point where one could use Python, Lua, Forth, et al to automate and control any system running a nix/BSD OS, and nix OS's do several key things rather well (in my opinion), such as service bootstrapping, lifecycle management, networking/comms, and maintaining a small footprint.

For whatever it's worth, one could start with nothing but a Debian ISO and some preseed files, and get to a point where they could orchestrate/launch anything they could imagine using their own language/application of choice, without ever touching having touched a shell prompt or writing a line of Bash. Not for nothing, that's almost certainly how many Linux-based customized distributions (and even full-blown custom/bespoke OS's) are created, but it doesn't have to be so complicated if one just wants to get to where Python scripts are able to run (for example).

hollerith · 6h ago

Most OSes no longer have any users or squeak by with less than 1000 users on their best day ever: Plan 9, OS/2, Beos, AmigaOS, Symbian, PalmOS, the OS for the Apple II, CP/M, VMS, TOPS-10, Multics, Compatible Time-Sharing System, Burroughs Master Control Program, Univac's Exec 8, Dartmouth Time-Sharing System, etc.

Some of the events that help Unix survive longer than most are the decision of DARPA (in 1979 or the early 1980s IIRC) to fund the addition of a TCP/IP networking stack to Unix and the decision in 1983 of Richard Stallman to copy the Unix design for his GNU project. The reason DARPA and Stallman settled on Unix was that they knew about it and were somewhat familiar with it because it was given away for free (mostly to universities and research labs). Success tends to beget success in "spaces" with strong "network externalities" such as the OS space.

>Getting this back to the OG point

I agree that it is easy to avoid writing shell scripts. The problem is that other people write them, e.g., as the recommended way to install some package I want. The recommended way to install a Rust toolchain for example is to run a shell script (rustup). I trust the Rust maintainers not to intentionally put an attack in the script, but I don't trust them not to have inadvertently included a vulnerability in the script that some third party might be able to exploit (particularly since it is quite difficult to write an attack-resistant shell script).

hollerith · 6h ago

OK, consider the browser market: are there any browsers that cost money? If so, I've not heard of it. From the beginning, Netscape Corporation, Microsoft, Opera and Apple gave away their browsers for free. That is because by the early 1990s it was well understood (at least by Silicon Valley execs) that what is important is grabbing mind share, and charging any amount of money would severely curtain the ability to do that.

In the 1970s when Unix started being distributed outside of Bell Labs, tech company execs did not yet understand that. The owners of Unix adopted a superior strategy to ensure survival of Unix by accident (namely, by being sued -- IIRC in the 1950s -- by the US Justice Department on anti-trust grounds).

osigurdson · 12h ago

Nobody likes coding in bash but everyone does it (a little) because it is everywhere.

forrestthewoods · 11h ago

> because it is everywhere

Except for the fact that actually it is not everywhere.

nativeit · 10h ago

I see your point, but bear with me here--it kind of is.

I suppose if one wanted to be pedantically literal, then you are indeed correct. In every other meaningful consideration, the parent comment is. Maybe not Bash specifically, but #!/bin/sh is broadly available on nearly every connected device on the planet, in some capacity. From the perspective of how we could automate nearly anything, you'd be hard-pressed to find something more universal than a shell script.

forrestthewoods · 10h ago

> you'd be hard-pressed to find something more universal than a shell script.

99.9% of my 20-year career has been spent on Windows. So bash scripts are entirely worthless and dead to me.

nativeit · 6h ago

What do you suppose the proportion is of computers actively running Windows in the world right now, versus those running some kind of *nix/BSD-based OS? This includes everything a person or machine could reasonably interface with, and that's Turing complete (in other words, a traffic light is limited to its own fixed logic, so it doesn't count; but most contemporary wifi routers contain general-purpose memory and processors, many even run some kind of *nix kernel, so they very much do count).

That's my case for Bash being more or less everywhere, but I think this debate is entirely semantic. Literally just talking about different things.

EDIT: escaped *

forrestthewoods · 6h ago

I think if someone were, for example, to release an open source C++ library and it only compiles for Linux or only comes with Bash scripts then I would not consider that library to be crossplatform nor would I consider it to run everywhere.

I don’t think it’s “just semantics”. I think it’s a meaningful distinction.

Game dev is a perhaps a small niche of computer programming. I mean these days the majority of programming is webdev JavaScript, blech. But game dev is also overwhelmingly Windows based. So I dispute any claim that Unix is “everywhere”. And I’m regularly annoyed by people who falsely pretend it is.

outworlder · 4h ago

Unix is everywhere. Except for Windows Desktops and a few places that like to run Windows Server enviroments. Those are increasingly rare.

forrestthewoods · 3h ago

So Unix is everywhere. Except for the places it’s not.

And “increasingly rare” does not mean “rare”. And even if it were “quite rare”, which it isn’t, that doesn’t imply “not supported”.

And just to add salt to the wound Linux is a profoundly worse dev environment. It’s quite the tragedy.

osigurdson · 8h ago

If you use git on Windows, bash is normally available. Agree, that this isn't widely used though.

forrestthewoods · 7h ago

Yeah I’ve never see anyone rely on users to use GitBash to run shell scripts.

Amusingly although I certainly use GitHub for hobby projects I’ve never actually used it for work. And have never come across a Windows project that mandated its use. Well, maybe one or two over the years.

osigurdson · 8m ago

Lot's of Microsoft shops (majority even) use ADO which is basically github + project management. I'd say most devs targeting Windows use git in some form. If not, what are you using? If git is installed, so is git bash. So bash is basically everywhere, even on Windows. The difference is, I'd say 1/50 devs using Windows actually know this.

baal80spam · 13h ago

Isn't it a bit like saying: saw is all you need (for carpenters)?

I mean, you _probably_ could make most furniture with only a saw, but why?

nativeit · 10h ago

In this analogy, do you have to design, construct, and learn from first principles to operate literally every other tool you'd like to use in addition to the saw?

Converge (YC S23) well-capitalized New York startup seeks product developers (runconverge.com)

Kyber (YC W23) Is Hiring Enterprise BDRs (ycombinator.com)

MindsDB (YC W20) is hiring an AI solutions engineer (job-boards.greenhouse.io)

Recurse Center (YC S10) Is Hiring a Career Facilitator (recurse.notion.site)

Cua (YC X25) is hiring an engineer (ycombinator.com)

Noloco (YC S21) is hiring a founder's associate in Barcelona (ycombinator.com)

14.ai (YC W24) hiring founding engineers in SF to build a Zendesk alternative (14.ai)

Lago (Open-Source Usage Based Billing) is hiring for ten roles (ycombinator.com)

Spark AI (YC W24) is hiring a full-stack engineer in SF (founding team) (ycombinator.com)

Bitmovin (YC S15) Is Hiring a Junior Solutions Engineer in Denver (bitmovin.com)

SigNoz (YC W21, Open Source Datadog) Is Hiring DevRel Engineers (Remote)(US) (ycombinator.com)

AccessOwl (YC S22) is hiring an Elixir Engineer to connect 100s of SaaS (ycombinator.com)

FurtherAI (YC W24) Is Hiring for Software and AI Roles (ycombinator.com)

Yarn (YC W24) is hiring engineers in NYC (ycombinator.com)

Expand.ai (YC S24) is hiring a founding engineer

Optifye.ai (YC W25) is hiring a back end engineer

Kastle (S24) is hiring an engineer (ycombinator.com)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Qfex (YC X25) – Back End Engineer for a 24/7 Stock Exchange (ycombinator.com)

Attimet (YC F24) – Quant Trading Research Lab – Is Hiring Founding Engineer (ycombinator.com)

Jiga (YC W21) Is Hiring Software Engs to Make Life of Mech Engs Easier (workatastartup.com)

Foundry (YC F24) Hiring Early Engineer to Build Web Agent Infrastructure (ycombinator.com)

Blaze (YC S24) Is Hiring (ycombinator.com)

Infracost (YC W21) is hiring software engineers (GMT+2 to GMT-6) (infracost.io)

Solidroad (YC W25) Is Hiring (solidroad.com)

Kyber (YC W23) Is Hiring a Technical Account Manager (ycombinator.com)

Roundtable (YC S23) Is Hiring a President / CRO (ycombinator.com)

Roame (YC S23) Is Hiring (ycombinator.com)

GauntletAI (YC S17): All expenses paid AI training and guaranteed $200k+ job (gauntletai.com)

SchemeFlow (YC S24) Is Hiring an Engineer (London) to Speed Up Construction (ycombinator.com)

Tools: Code Is All You Need

Comments (200)