My tiny (1000 lines), fastish, embeddable scripting language can be AOT compiled (wareya.wordpress.com)

Silent training errors are bugs in ML training jobs that do not crash, raise exceptions, or obviously break metrics. They waste GPU hours and silently degrade model quality, and often go unnoticed until it is too late.

An early example of a silent error from Bloom-176B training, where model parameters silently diverged across GPUs and caused conflicting model checkpoints: https://github.com/bigscience-workshop/bigscience/blob/maste...

TrainCheck is a runtime checking tool that catches these issues early. It automatically infers invariants (semantic correctness rules) from known-good training runs, such as official examples maintained by framework developers. It enforces them on new runs to spot subtle violations in-flight. Many errors can be caught within a single iteration. TrainCheck also tries to make invariants transferable across runs, setups, and even code structures, so that users can immediately benefit from TrainCheck, without always needing to infer invariants from very specialized setups/envs.

We have used TrainCheck to uncover 18 real-world silent errors in popular PyTorch, DeepSpeed, and HuggingFace pipelines, including one during BLOOM-176B pretraining that standard metric monitoring missed. We also found 6 new bugs in DeepSpeed and Transformers.

- A 5-minute TrainCheck experience for you to try out: https://github.com/OrderLab/TrainCheck/blob/main/docs/5-min-...

- Link to the paper and slides: https://www.usenix.org/conference/osdi25/presentation/jiang

For anyone training large or long-running ML models, what silent bugs have you run into, and how do you catch them today?

Comments (1)

justmattyou · 23h ago

We have also spent a huge amount of effort making TrainCheck work for real ML engineering workflows, while also dealing with the quirks and limitations of Python itself. Here are a few of the hardest challenges we are still working through, and we would love to hear how others have approached similar problems:

1. Instrumentation that just works

TrainCheck monkey patches APIs in PyTorch and other frameworks so users can start monitoring without building custom binaries. This is convenient, but it can break tools like torchDynamo and sometimes trigger low-level issues. We are exploring ways to keep the plug-and-play experience while reducing overhead by instrumenting only APIs that matter.

2. Single-threaded inference

Our traces are large, schema-less, and heterogeneous (Python dictionaries with things like tensors of different shapes). This rules out simple vectorization and keeps us stuck with Pandas/NumPy in a single thread. Suggestions for scalable querying on irregular data are welcome.

3. Overlapping CPU and GPU work

CPU-heavy steps like trace serialization and invariant inference could overlap with GPU compute, but Python’s IPC costs and limited shared memory support (assuming we have to use multiprocessing) make this problematic. We are curious about practical approaches that avoid major rewrites in C++.

I Built Revenue‑Generating SaaS with Loveable, N8n, and Helpful Freelancers

North Koreans tell BBC they are being sent to work 'like slaves' in Russia (bbc.co.uk)

Agent tars beta, local Multimodal AI Agent (agent-tars.com)

LLMs' "simulated reasoning" abilities are a brittle mirage (arstechnica.com)

ChatGPT-5: A Review (the-hinternet.com)

UptimeEye – Flow monitoring instead of API monitoring (uptimeeye.com)

Trump sparks concern after suggesting sales of Nvidia advanced AI chips in China (theguardian.com)

China's Automakers Are Taking a Shortcut to European Markets (nytimes.com)

Altman says in 10 years college grads will work well-paid job in space (fortune.com)

Show HN: Sarpro – Fast Sentinel-1 SAR GRD → GeoTIFF/JPEG in Rust (github.com)

Why '1984' Isn't Banned in China (theatlantic.com)

US administration threatens to take Harvard's patents (cbsnews.com)

Deleting emails will not save the planet (2024) (bertptrs.nl)

Hong Kong's Economy Has Two Gravity-Defying Puzzles (bloomberg.com)

GPT-5s Are Alive: Basic Facts, Benchmarks and the Model Card (thezvi.substack.com)

Spain Is an Example to the World (nytimes.com)

"Tooth in Eye" Surgery (en.wikipedia.org)

He Announced His Intention to Die. The Dinner Invitations Rolled In (nytimes.com)

A decade after losing her sight, a B.C. woman can see again – through her tooth (cbc.ca)

Claude can now reference past chats (twitter.com)

CCP Influence in U.S. Pro-Palestinian Activism [pdf] (extremism.gwu.edu)

Ask HN: Does Google Drive secretly delete files?

The Puzzle of Low Data Center Utilization Rates (powerpolicy.net)

Sonshi-Style a.k.a. Keyboard on Laptop (xn--gckvb8fzb.com)

My tiny (1000 lines), fastish, embeddable scripting language can be AOT compiled (wareya.wordpress.com)

Enter Slopium, blockchain-esc AI slop minting of code commit fixes

Staff fear UK's Turing AI Institute at risk of collapse (bbc.co.uk)

Information about the UK's Online Safety Act 2023 (onlinesafetyact.co.uk)

Power Shift: India's EV Grand Prix (dealflowiq.com)

Show HN: YouTube MCP Tool for Puch AI

Compiler Bug Causes Compiler Bug: How a 12-Year-Old G++ Bug Took Down Solidity (osec.io)

Securing the supply chain scale: Starting with 71 important open source projects (github.blog)

GPU Performance Analytics (gpus.axiomgaming.net)

Galileo's Telescopes: Seeing Is Believing (historytoday.com)

Sierra Club board fires leader Ben Jealous (politico.com)

The biggest victory against animal cruelty of the 21st century, in one chart (vox.com)

Space Elevator (neal.fun)

HN vs. Reddit – Pros and Cons of Each

'Constantine Cavafy' Review: A Poet's Odyssey Within (wsj.com)

Nginx-Defender: Enterprise-Grade WAF with Advanced Threat Intelligence (github.com)

Australian bank gives out customer phone to another customer by asking ChatGPT (hey.paris)

How Do Committees Fail to Invent? (infrequently.org)

The First Pirated MP3 (dfarq.homeip.net)

Show HN: Fr) Restaurant management software / Logiciel de gestion de restaurant (intuifood.com)

Show HN: Small businesses don't need ChatGPT they need this

LLM 0.27: GPT-5 and improved tool calling (simonwillison.net)

Not So Direct I/O (jmcph4.dev)

Show HN: A Fun Chinese Name Generator (generatornamachina.com)

The PC and Internet Revolution in Rural America (complete.org)

StarDict sends X11 clipboard to remote servers (lwn.net)

Show HN: TrainCheck – Catch ML Training Bugs Before It's Too Late

Comments (1)