All these new tools are so exciting, but running untrusted code which auto-updates itself is blocking me from trying these tools.
I wish for a vetting tool. Have an LLM examine the code then write a spec of what it reads and writes, & you can examine that before running it. If something in the list is suspect.. you’ll know before you’re hosed not after :)
nothrabannosir · 3h ago
Throwing more llm at a prompt escaper is like throwing more regexp at a html parser.
If the first llm wasn’t enough, the second won’t be either. You’re in the wrong layer.
Not a professional developer (though Guillermo certainly is) so take this with a huge grain of salt, but I like the idea of an AI "trained" on security vulnerabilities as a second, third and fourth set of eyes!
ffsm8 · 1h ago
I'm not sure how to take that seriously with the current reality where almost all security findings by LLM tools are false positives
While I suspect that's gonna work good enough on synthetic examples for naive and uninformed people to get tricked into trusting it... At the very least, current LLMs are unable to provide enough stability for this to be useful.
It might become viable with future models, but there is little value in discussing this approach currently.
At least until someone actually made a PoC thats at least somewhat working as designed, without having a 50-100% false positive quota.
You can have some false positives, but it has to be low enough for people to still listen to it, which currently isn't the case.
adastra22 · 13m ago
Put it in a docker instance with a mounted git worktree?
troupo · 13m ago
> All these new tools are so exciting,
Most of these tools are not that exciting. These are similar-looking TUIs around third-paty models/LLM calls.
What is the difference between this, and https://opencode.ai? Or any of the half a dozen tools that appeared on HN in the past few weeks?
lionkor · 4h ago
that's cool and all, before you get malicious code that includes prompt injections and code that never runs but looks super legit.
LLMs are NOT THOROUGH. Not even remotely. I don't understand how anyone can use LLMs and not see this instantly. I have yet to see an LLM get a better failure rate than around 50% in the real world with real world expectations.
Especially with code review, LLMs catch some things, miss a lot of things, and get a lot of things completely and utterly wrong. It takes someone wholly incompetent at code review to look at an LLM review and go "perfect!".
Edit: Feel free to write a comment if you disagree
stpedgwdgfhgdd · 3h ago
My suggestion is to try CC, use a language like Go, and read their blogs how they use it internally. They are transparent what works and what does not work.
resonious · 4h ago
If you know that LLMs are not thorough going into it, then you can get your failure rates way lower than 50%. Of course if you just paste a product spec into an LLM, it will do a bad job.
If you build an intuition for what kinds of asks an LLM (agent, really) can do well, you can choose to only give it those tasks, and that's where the huge speedups come from.
Don't know what to do about prompt injection, really. But "untrusted code" in the broader sense has always been a risk. If I download and use a library, the author already has free reign of my computer - they don't even need to think about messing with my LLM assistant.
nxobject · 1h ago
Unfortunately, I haven’t been able to use this with many of the recent open weight code/instruct models - CC tool use doesn’t work with Qwen3 and Kimi K2 for me.
crocowhile · 4h ago
This is what got me started with claude-code. I gave it a try using openrouter API and got a bill of $40 for 2-3 hours of work. At that point, subscription to the Anthropic plan became a no-brainer
blitzar · 2h ago
What is the secret sauce of Claude Code that makes it, somewhat irrespective of the backend LLM, better than the competition?
Is it just better prompting? Better tooling?
CuriouslyC · 1m ago
The agentic instructions just seem to be better. It does stuff by default (such as working up a plan of action) that other agents need to be prompted for, and it seems to get stuck less in failure sinks. The actual Claude model is decent, but claude code is probably the best agentic tool out there right now.
ethan_smith · 23m ago
Claude's edge comes from its superior context handling (up to 200K tokens), better tool use capabilities, and constitutional AI training that reduces hallucinations in code generation.
EnPissant · 6h ago
Claude Code with a plan is so much cheaper than any API.
sylware · 2h ago
It is a bit off-topic here, but anybody tried to use such LLMs for code porting: from c++ (and similar) to plain C99+?
I wish for a vetting tool. Have an LLM examine the code then write a spec of what it reads and writes, & you can examine that before running it. If something in the list is suspect.. you’ll know before you’re hosed not after :)
If the first llm wasn’t enough, the second won’t be either. You’re in the wrong layer.
Not a professional developer (though Guillermo certainly is) so take this with a huge grain of salt, but I like the idea of an AI "trained" on security vulnerabilities as a second, third and fourth set of eyes!
While I suspect that's gonna work good enough on synthetic examples for naive and uninformed people to get tricked into trusting it... At the very least, current LLMs are unable to provide enough stability for this to be useful.
It might become viable with future models, but there is little value in discussing this approach currently. At least until someone actually made a PoC thats at least somewhat working as designed, without having a 50-100% false positive quota.
You can have some false positives, but it has to be low enough for people to still listen to it, which currently isn't the case.
Most of these tools are not that exciting. These are similar-looking TUIs around third-paty models/LLM calls.
What is the difference between this, and https://opencode.ai? Or any of the half a dozen tools that appeared on HN in the past few weeks?
LLMs are NOT THOROUGH. Not even remotely. I don't understand how anyone can use LLMs and not see this instantly. I have yet to see an LLM get a better failure rate than around 50% in the real world with real world expectations.
Especially with code review, LLMs catch some things, miss a lot of things, and get a lot of things completely and utterly wrong. It takes someone wholly incompetent at code review to look at an LLM review and go "perfect!".
Edit: Feel free to write a comment if you disagree
If you build an intuition for what kinds of asks an LLM (agent, really) can do well, you can choose to only give it those tasks, and that's where the huge speedups come from.
Don't know what to do about prompt injection, really. But "untrusted code" in the broader sense has always been a risk. If I download and use a library, the author already has free reign of my computer - they don't even need to think about messing with my LLM assistant.
Is it just better prompting? Better tooling?
We may finally get to the devs doing lock-in using ultra complex syntax languages in a much more efficient way using LLMs.
I have already some ideas for some target c++ code to port to C99+.
1: https://aider.chat/
Ofc some might prefer the pure CLI experience, but mentioning that because it also supports a lot of providers.