Show HN: TrainCheck – Catch ML Training Bugs Before It's Too Late
An early example of a silent error from Bloom-176B training, where model parameters silently diverged across GPUs and caused conflicting model checkpoints: https://github.com/bigscience-workshop/bigscience/blob/maste...
TrainCheck is a runtime checking tool that catches these issues early. It automatically infers invariants (semantic correctness rules) from known-good training runs, such as official examples maintained by framework developers. It enforces them on new runs to spot subtle violations in-flight. Many errors can be caught within a single iteration. TrainCheck also tries to make invariants transferable across runs, setups, and even code structures, so that users can immediately benefit from TrainCheck, without always needing to infer invariants from very specialized setups/envs.
We have used TrainCheck to uncover 18 real-world silent errors in popular PyTorch, DeepSpeed, and HuggingFace pipelines, including one during BLOOM-176B pretraining that standard metric monitoring missed. We also found 6 new bugs in DeepSpeed and Transformers.
- A 5-minute TrainCheck experience for you to try out: https://github.com/OrderLab/TrainCheck/blob/main/docs/5-min-...
- Link to the paper and slides: https://www.usenix.org/conference/osdi25/presentation/jiang
For anyone training large or long-running ML models, what silent bugs have you run into, and how do you catch them today?
1. Instrumentation that just works
TrainCheck monkey patches APIs in PyTorch and other frameworks so users can start monitoring without building custom binaries. This is convenient, but it can break tools like torchDynamo and sometimes trigger low-level issues. We are exploring ways to keep the plug-and-play experience while reducing overhead by instrumenting only APIs that matter.
2. Single-threaded inference
Our traces are large, schema-less, and heterogeneous (Python dictionaries with things like tensors of different shapes). This rules out simple vectorization and keeps us stuck with Pandas/NumPy in a single thread. Suggestions for scalable querying on irregular data are welcome.
3. Overlapping CPU and GPU work
CPU-heavy steps like trace serialization and invariant inference could overlap with GPU compute, but Python’s IPC costs and limited shared memory support (assuming we have to use multiprocessing) make this problematic. We are curious about practical approaches that avoid major rewrites in C++.