Tp improve train logic (!57) · Merge requests · GNN4ITkTeam / acorn

Minh Tuan Pham requested to merge tp_improve_train_logic into dev Oct 30, 2023

Some improvement to the training logic. This addresses the problems encounter when resuming from a previous run. In particular:

When resuming from a standalone checkpoint in a batch job which does not finish training within the requested job time, upon subsequent resumption, the trainer restarts from the checkpoint and redoes the training previously accomplished.
When resuming from a batch job which manages to save HPC checkpoints but nevertheless fails, which happens very frequently on Perlmutter, there is no mechanism to specify the slurm job id, also the default_root_dir of the previous run for easy resumption.

The goal of this merge request is to address these issue by:

Overhauling the mechanism of resuming from a standalone checkpoint. Upon the first initialisation, there is no default_root_dir named after the slurm job id. The trainer resumes training from the checkpoint as usual, creating the default_root_dir after the slurm job id and saving an hpc checkpoint if unable to finish training before time is up. However, after reinitialisation, it must detect the default_root_dir with an hpc checkpoint, from which to resume training instead of the standalone checkpoint.
Add an option to resume from a previous failed slurm job.

Add a new argument called checkpoint_resume_dir that specifies a directory from which to resume the run and pass it on to the core util functions.

Most major changes occur under get_stage_module method. The new logic is (1) get default_root_dir which is guaranteed to be the slurm job_id if running a batch job. (2) If checkpoint_resume_dir is specified, check if it exists and contains hpc checkpoints, if so set it as the default_root_dir, otherwise, raise exceptions. (3) If default_root_dir exists and contains checkpoints, set the latest one as the checkpoint_path, regardless of what is passed to the argument --checkpoint_path. (4) Initiate from checkpoint_path if not None, else initiate a new model instance. (3) and (4) guarantee that the first initialisation from a checkpoint proceeds as usual but the second initialisation does from the hpc checkpoint, to avoid repeating the training.
Some minor changes to the get_trainer method.

Addition of proportional weighting of the loss function and separation of the loss to positive and negative loss. All three values are logged during training and validation. The default loss function balancing is changed to proportional loss, i.e. L=\frac{1}{N_{-}+N_{+}}\sum_i(y_i-f(x_i)).
Addition of GNNFilter class.

Edited Oct 30, 2023 by Minh Tuan Pham

Tp improve train logic