Support Slurm requeue
Allow for the usage of Lightning's ability to requeue a Slurm job when it runs into the system walltime and would otherwise be killed. A new flag is added to submit_slurm.py that tells it to catch the signal send from Slurm that it will soon be killed. It then saves a checkpoint and requeues. The new job loads the state from the checkpoint and resumes training.
-
CI Passing -
Comments addressed -
Source branch is up to date with target