Skip to content

Resuming training with full training state restored

Jay Chan requested to merge jay_fix_manual_resume_training into dev

Addressing issue #92 (closed)

Will first check if there is any checkpoint associated with the slurm job ID. Is so, run the training starting from the latest checkpoint associated with that job ID. If not, use the checkpoint specified in the user arguments.

In addition, also added an option to only load the model parameters instead of loading the full training states.

Edited by Jay Chan

Merge request reports

Loading