Resuming training with full training state restored

requested review from @xju, @dmurnane, @scaillou, @allazar, and @tupham

Hi @cchan, thanks for the MR. Have you made sure that if you start the training from the checkpoint, it will continue from the HPC checkpoint upon automatic resumption, if the training doesn't finish within 1 allocation?

Could you also update with the dev branch?

added 68 commits

2beaefcc...3aeb24c4 - 66 commits from branch dev
94c2b022 - Merge branch 'dev' into jay_fix_manual_resume_training
cc2179b9 - add option to not load training state

Compare with previous version

changed the description

What's the difference between entrypoint_stage_cf and entrypoint_stage? Do you plan to keep the maintenance of entrypoint_stage_cf? I guess we will remove the old CommonFramework at some point to avoid possible confusion.

added 1 commit

ee48da47 - merge codes

Compare with previous version

approved this merge request

resolved all threads

added 5 commits

ee48da47...ece06aa6 - 2 commits from branch dev
e07f2628 - fix
257c04ed - add option to not load training state
8aaafd5b - merge codes

Compare with previous version

enabled an automatic merge when the pipeline for 8aaafd5b succeeds

mentioned in commit 600341aa

merged

@cchan This is actually breaking the dev branch. Super important: Can you provide a quick MR that does the following:

Adds load_only_model_parameters as a parameter to the train function here: https://gitlab.cern.ch/gnn4itkteam/acorn/-/blob/8aaafd5b7a70f0e81a452b5e3b0f29b444d34125/acorn/core/entrypoint_stage.py#L26

Hi @dmurnane sorry about that! I have created the MR here: !117 (merged).

mentioned in merge request !117 (merged)

Resuming training with full training state restored

Merge request reports

Activity