Skip to content
Snippets Groups Projects

Resuming training with full training state restored

Merged Jay Chan requested to merge jay_fix_manual_resume_training into dev
1 unresolved thread

Addressing issue #92 (closed)

Will first check if there is any checkpoint associated with the slurm job ID. Is so, run the training starting from the latest checkpoint associated with that job ID. If not, use the checkpoint specified in the user arguments.

In addition, also added an option to only load the model parameters instead of loading the full training states.

Edited by Jay Chan

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
    • Resolved by Xiangyang Ju

      What's the difference between entrypoint_stage_cf and entrypoint_stage? Do you plan to keep the maintenance of entrypoint_stage_cf? I guess we will remove the old CommonFramework at some point to avoid possible confusion.

  • Jay Chan added 1 commit

    added 1 commit

    Compare with previous version

  • Xiangyang Ju approved this merge request

    approved this merge request

  • Minh Tuan Pham approved this merge request

    approved this merge request

  • Jay Chan resolved all threads

    resolved all threads

  • Jay Chan added 5 commits

    added 5 commits

    Compare with previous version

  • Jay Chan enabled an automatic merge when the pipeline for 8aaafd5b succeeds

    enabled an automatic merge when the pipeline for 8aaafd5b succeeds

  • Jay Chan mentioned in commit 600341aa

    mentioned in commit 600341aa

  • merged

  • Jay Chan mentioned in merge request !117 (merged)

    mentioned in merge request !117 (merged)

  • Please register or sign in to reply
    Loading