Skip to content

DRAFT: Enable reliable auto-resume

Daniel Thomas Murnane requested to merge dmurnane_fix_distributed_autoresume into dev

This MR aims to do one thing: fix the broken auto-resume behaviour that especially appears when doing distributed training. This occurs because training is resumed mid-way through an epoch. We can avoid this by specifying Trainer.fit(ckpt_path="last"), which uses the custom saved checkpoint at the end of the last epoch. But, it turns out to achieve this, we need to restructure some of the directory management. It's boring, but it works like this: image

  • Explicit checkpoint path works as expected
  • Explicit checkpoint dir works as expected
  • SLURM batch resume works as expected (1 GPU)
  • SLURM batch resume works as expected (4 GPUs)
  • Local W&B works as expected
  • No logger, no checkpoint, working as expected
  • Ensure that infer stage works with explicit checkpoint
  • Ensure that infer stage works automatically without checkpoint
  • Ensure that eval stage works with explicit checkpoint
  • Ensure that eval stage works automatically without checkpoint
Edited by Daniel Thomas Murnane

Merge request reports