Necessary tool? Async LoRA for distributed systems

2 jfileto 0 9/16/2025, 4:37:34 AM
I’ve been building something I call Async LoRA to scratch an itch I kept running into: training on cheap GPUs (Salad, runpod, spot instances, etc.) is a nightmare for long jobs. One random node dying and suddenly hours of training are gone. Most schedulers just restart the whole container, which doesn’t really help. What I’ve put together so far:

• Aggregator/worker setup where the aggregator hands out small “leases” of work (per token sizes not time slices)

• Async checkpointing so progress gets saved continuously without pausing training.

• Preemption handling — when a worker dies, whatever it managed to do still counts, and the remaining work just gets reassigned.

• Training-aware logic (steps, tokens, loss) instead of treating jobs like black-box containers.

• Out-of-the-box hooks for PyTorch/DeepSpeed so you don’t have to glue it all together yourself. My goal is to make sketchy clusters behave more like reliable ones

I’d love feedback from people here:

• If you run training on spot/preemptible GPUs, how do you usually handle checkpoints/failures?

• What would make this easier to drop into an existing pipeline (Airflow, K8s, Ray, etc.)?

• For monitoring, would you rather see native training metrics (loss, tokens, staleness) or just surface logs/events and let you plug into your own stack?

Comments (0)

No comments yet