Why do LLMs still not run code before giving it to you?

1 highfrequency 2 8/3/2025, 7:58:37 PM

The leading models all advertise tool use including code execution. So why is it still common to receive a short Python script containing a logical bug which would be immediately discoverable upon running a Python interpreter for 0.1 seconds? Is it a safety concern / difficulty sandboxing in a VM? Surely not a resource consumption issue given the price of a single CPU core vs. GPU.

Comments (2)

tlb · 50m ago

Is it a common use case to produce a standalone program that could be tested in isolation? Usually I'm asking for a function (or just a few lines of change) that depends on the rest of my code & environment, so it's not trivial to test.

chasing0entropy · 1h ago

Sounds like an opportunity for you to make the world better by designing the process and implementing it.

No comments yet

Adhdhq.com: A suite of productivity apps designed by and for people with ADHD (adhdhq.com)

I changed my mind on the genocide charge against Israel (medium.com)

Launch: Free and Focused Password Generator at GeneratePassword12 (generatepassword12.org)

The Evolution of a Programmer (usr.lmf.cnrs.fr)

Happy Bees Make Delicious Honey (shop.bouldervalleyhoney.com)

Radioactive Wasps Found at Nuclear Facility in South Carolina (nytimes.com)

Show HN: NameFast – Instantly generate brandable names for your SaaS or startup

International Business Ambush (henry.precheur.org)

"If you can rack it, you can run UniFi OS" Ubiquiti self-hosted UniFi OS release (deluisio.com)

Show HN: I built a tool that replies to emails for you (replyfast.net)

Time to upgrade my laptop – need advice please

Interview with ChatGPT Agent Creators (sequoiacap.com)

Building Software for Medical Clinics in Ethiopia (alexcwatt.com)

OpenAI Share: Dad Jokes with Math (chatgpt.com)

One Dataset. No Warning. Google Took Everything. You're Not Safe Either (medium.com)

China: A Safe Haven from the NSA (medium.com)

Table of Nuclides (en.wikipedia.org)

Nintendo Switch Pricing Update (nintendo.com)

Show HN: Daily logic puzzles based on Blue Prince Parlor room (parlorbox.com)

What will the AI revolution mean for the global south? (theguardian.com)

Hyperspace Is a Scary Place (tvtropes.org)

Project Euler (projecteuler.net)

Monte Carlo Crash Course: Quasi-Monte Carlo (thenumb.at)

In the Future All Food Will Be Cooked in a Microwave (colincornaby.me)

DeepMind AI collab used to accelerate decoding of Ancient Roman inscriptions (bbc.com)

AI Data Centers Force Texans to Shower Less (aigovernancelead.substack.com)

Htmlpp, an HTML Preprocessor? (2000) (htmlpp.sourceforge.net)

Latest Flux Image Generator (ai-flux.io)

Nova AI Assistant – Intelligence that obeys, without limits (pitchhut.com)

Show HN: Apple AirTag Page Recreated in React and Tailwind [video] (youtube.com)

Train Board Generator (jochembruijninckx.nl)

Show HN: Implementing and Training a Transformer and Tokenizer in Rust (github.com)

Dutch Dikes (dutchdikes.net)

Emulation of 3Sum, 4Sum, the FMA and FD2 in Rounded-to-Nearest FP Arithmetic (hal.science)

Longevity Firms Push Montana to Become Hub for Biohacking/Experimental Treatment (wsj.com)

Fall 2025 International Student Enrollment Outlook and Economic Impact (nafsa.org)

Show HN: Phlebas, a live timeseries sim controlled by the console (greenvitriol.com)

Towards sustainable open source – Sniffnet's 3rd anniversary (sniffnet.net)

Zuck Says AI Will Make Advertising So Good Its Share of GDP Will Grow (mylesyounger.substack.com)

More than two hard disks in DOS (os2museum.com)

Satellites and drone swarms: The new high-tech quest to fight wildfires (msn.com)

New Hidden State of Matter Could Make Computers 1,000x Faster (popularmechanics.com)

How to avoid dynamic linking of Steam's client library using a old trick (nullonerror.org)

Comparing Wage Growth for Job Stayers and Job Switchers (apolloacademy.com)

Ferrari F355 Simulator for Sale Is Peak Retro SIM Racing (thedrive.com)

Writing a Good Design Document (grantslatton.com)

Ask HN: When to Open a Beta?

How McKinsey lost its edge (economist.com)

Heat index of 182 degrees in Iran likely false (accuweather.com)

Italy's Undercover Pizza Detectives (bbc.com)

Why do LLMs still not run code before giving it to you?

Comments (2)