Resuming training with full training state restored
Addressing issue #92 (closed)
Will first check if there is any checkpoint associated with the slurm job ID. Is so, run the training starting from the latest checkpoint associated with that job ID. If not, use the checkpoint specified in the user arguments.
In addition, also added an option to only load the model parameters instead of loading the full training states.
Merge request reports
Activity
- Resolved by Jay Chan
Hi @cchan, thanks for the MR. Have you made sure that if you start the training from the checkpoint, it will continue from the HPC checkpoint upon automatic resumption, if the training doesn't finish within 1 allocation?
Could you also update with the dev branch?
Edited by Minh Tuan Pham
added 68 commits
-
2beaefcc...3aeb24c4 - 66 commits from branch
dev
- 94c2b022 - Merge branch 'dev' into jay_fix_manual_resume_training
- cc2179b9 - add option to not load training state
-
2beaefcc...3aeb24c4 - 66 commits from branch
- Resolved by Xiangyang Ju
- Resolved by Xiangyang Ju
What's the difference between
entrypoint_stage_cf
andentrypoint_stage
? Do you plan to keep the maintenance ofentrypoint_stage_cf
? I guess we will remove the old CommonFramework at some point to avoid possible confusion.
added 5 commits
-
ee48da47...ece06aa6 - 2 commits from branch
dev
- e07f2628 - fix
- 257c04ed - add option to not load training state
- 8aaafd5b - merge codes
Toggle commit list-
ee48da47...ece06aa6 - 2 commits from branch
enabled an automatic merge when the pipeline for 8aaafd5b succeeds
mentioned in commit 600341aa
@cchan This is actually breaking the dev branch. Super important: Can you provide a quick MR that does the following:
- Adds load_only_model_parameters as a parameter to the train function here: https://gitlab.cern.ch/gnn4itkteam/acorn/-/blob/8aaafd5b7a70f0e81a452b5e3b0f29b444d34125/acorn/core/entrypoint_stage.py#L26
Hi @dmurnane sorry about that! I have created the MR here: !117 (merged).
mentioned in merge request !117 (merged)