Show HN: TrainCheck – Catch ML Training Bugs Before It's Too Late

3 justmattyou 1 8/11/2025, 6:34:04 AM github.com ↗
Silent training errors are bugs in ML training jobs that do not crash, raise exceptions, or obviously break metrics. They waste GPU hours and silently degrade model quality, and often go unnoticed until it is too late.

An early example of a silent error from Bloom-176B training, where model parameters silently diverged across GPUs and caused conflicting model checkpoints: https://github.com/bigscience-workshop/bigscience/blob/maste...

TrainCheck is a runtime checking tool that catches these issues early. It automatically infers invariants (semantic correctness rules) from known-good training runs, such as official examples maintained by framework developers. It enforces them on new runs to spot subtle violations in-flight. Many errors can be caught within a single iteration. TrainCheck also tries to make invariants transferable across runs, setups, and even code structures, so that users can immediately benefit from TrainCheck, without always needing to infer invariants from very specialized setups/envs.

We have used TrainCheck to uncover 18 real-world silent errors in popular PyTorch, DeepSpeed, and HuggingFace pipelines, including one during BLOOM-176B pretraining that standard metric monitoring missed. We also found 6 new bugs in DeepSpeed and Transformers.

- A 5-minute TrainCheck experience for you to try out: https://github.com/OrderLab/TrainCheck/blob/main/docs/5-min-...

- Link to the paper and slides: https://www.usenix.org/conference/osdi25/presentation/jiang

For anyone training large or long-running ML models, what silent bugs have you run into, and how do you catch them today?

Comments (1)

justmattyou · 23h ago
We have also spent a huge amount of effort making TrainCheck work for real ML engineering workflows, while also dealing with the quirks and limitations of Python itself. Here are a few of the hardest challenges we are still working through, and we would love to hear how others have approached similar problems:

1. Instrumentation that just works

TrainCheck monkey patches APIs in PyTorch and other frameworks so users can start monitoring without building custom binaries. This is convenient, but it can break tools like torchDynamo and sometimes trigger low-level issues. We are exploring ways to keep the plug-and-play experience while reducing overhead by instrumenting only APIs that matter.

2. Single-threaded inference

Our traces are large, schema-less, and heterogeneous (Python dictionaries with things like tensors of different shapes). This rules out simple vectorization and keeps us stuck with Pandas/NumPy in a single thread. Suggestions for scalable querying on irregular data are welcome.

3. Overlapping CPU and GPU work

CPU-heavy steps like trace serialization and invariant inference could overlap with GPU compute, but Python’s IPC costs and limited shared memory support (assuming we have to use multiprocessing) make this problematic. We are curious about practical approaches that avoid major rewrites in C++.