GPU Memory Snapshots: fast container cold boots

8 luiscape 1 7/31/2025, 4:14:16 PM modal.com ↗

Comments (1)

luiscape · 2h ago
Modal eng here.

We have been using the new CUDA Checkpoint API (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CH...) in combination with gVisor's checkpoint / restore API and our custom file system to greatly reduce container cold boot. This is particularly impactful if you need to warm-up GPUs, for example if you are using torch.compile (i.e. you entirely skip torch.compile on restore cold boot).