Draft: Heterogeneous dev (!30) · Merge requests · GNN4ITkTeam / acorn

Minh Tuan Pham requested to merge heterogeneous_dev into master Jun 20, 2023

Following are the major changes in this merge request:

Most of these changes are geared towards a SLURM job submission such as that on Perlmutter.

Changes to core_utils.py: By default, a new batch job uses the slurm job_id as the id and display name on Wandb, as well as a subdirectory of os.getcwd() in which HPC chekcpoints are save. However, in case the job fails, it cannot be resumed and logged to the same wandb id/display name, making it difficult to track on wandb. Under these changes, the user optionally provides the slurm job id of the run they want to resume, which is easily found on wandb. The CTF sets the default working dir to f"{os.getcwd()}/{slurm_jobid}", resumes from the last HPC checkpoint and logs to the wandb id of the last run.
Changes to train_stage.py: Add option to provide a slurm job_id to resume job.
Changes to infer_stage.py: Explicitly provide a checkpoint to run inference, instead of searching for the latest checkpoint by default. (Could consider providing the checkpoint within the infer_config.
Changes to eval_stage.py: Explicitly provide a checkpoint to run evaluation, instead of searching for the latest checkpoint by default.
Addition of heterogeneous graph dataset and heterogeneous model under stages/edge_classifier
Changes to edge_classifier_stage.py: Move all plotting functions out of the LightningModule

Draft: Heterogeneous dev