DRAFT: Enable reliable auto-resume

Review changes
Download
Patches
Plain diff

Daniel Thomas Murnane requested to merge dmurnane_fix_distributed_autoresume into dev Apr 26, 2024

Overview 1
Commits 3
Pipelines 6
Changes 2

This MR aims to do one thing: fix the broken auto-resume behaviour that especially appears when doing distributed training. This occurs because training is resumed mid-way through an epoch. We can avoid this by specifying Trainer.fit(ckpt_path="last"), which uses the custom saved checkpoint at the end of the last epoch. But, it turns out to achieve this, we need to restructure some of the directory management. It's boring, but it works like this:

Explicit checkpoint path works as expected
Explicit checkpoint dir works as expected
SLURM batch resume works as expected (1 GPU)
SLURM batch resume works as expected (4 GPUs)
Local W&B works as expected
No logger, no checkpoint, working as expected
Ensure that infer stage works with explicit checkpoint
Ensure that infer stage works automatically without checkpoint
Ensure that eval stage works with explicit checkpoint
Ensure that eval stage works automatically without checkpoint

Edited Apr 26, 2024 by Daniel Thomas Murnane

Merge request reports

Assignee

Assign to

Reviewers

Request review from

Time tracking