Skip to content

Support Slurm requeue

Nicholas Luongo requested to merge nicholas/salt:requeue into main

Allow for the usage of Lightning's ability to requeue a Slurm job when it runs into the system walltime and would otherwise be killed. A new flag is added to submit_slurm.py that tells it to catch the signal send from Slurm that it will soon be killed. It then saves a checkpoint and requeues. The new job loads the state from the checkpoint and resumes training.

  • CI Passing
  • Comments addressed
  • Source branch is up to date with target

Merge request reports