Why do LLMs still not run code before giving it to you?
1 highfrequency 3 8/3/2025, 7:58:37 PM
The leading models all advertise tool use including code execution. So why is it still common to receive a short Python script containing a logical bug which would be immediately discoverable upon running a Python interpreter for 0.1 seconds? Is it a safety concern / difficulty sandboxing in a VM? Surely not a resource consumption issue given the price of a single CPU core vs. GPU.
if you're doing TDD style work but with an AI it's not uncommon to one-shot a function and then throw it against your battery of tests.
it's also pretty doable if you're writing smallish scripts or trying to follow functional coding paradigms; with functional stuff it's often easy to pick apart the specific modules for testing against criteria.
No comments yet