Necessary tool? Async LoRA for distributed systems
• Aggregator/worker setup where the aggregator hands out small “leases” of work (per token sizes not time slices)
• Async checkpointing so progress gets saved continuously without pausing training.
• Preemption handling — when a worker dies, whatever it managed to do still counts, and the remaining work just gets reassigned.
• Training-aware logic (steps, tokens, loss) instead of treating jobs like black-box containers.
• Out-of-the-box hooks for PyTorch/DeepSpeed so you don’t have to glue it all together yourself. My goal is to make sketchy clusters behave more like reliable ones
I’d love feedback from people here:
• If you run training on spot/preemptible GPUs, how do you usually handle checkpoints/failures?
• What would make this easier to drop into an existing pipeline (Airflow, K8s, Ray, etc.)?
• For monitoring, would you rather see native training metrics (loss, tokens, staleness) or just surface logs/events and let you plug into your own stack?
No comments yet