Draft: Heterogeneous dev
Following are the major changes in this merge request:
Most of these changes are geared towards a SLURM job submission such as that on Perlmutter.
- Changes to
core_utils.py
: By default, a new batch job uses the slurmjob_id
as the id and display name on Wandb, as well as a subdirectory ofos.getcwd()
in which HPC chekcpoints are save. However, in case the job fails, it cannot be resumed and logged to the same wandb id/display name, making it difficult to track on wandb. Under these changes, the user optionally provides the slurm job id of the run they want to resume, which is easily found on wandb. The CTF sets the default working dir to f"{os.getcwd()}/{slurm_jobid}", resumes from the last HPC checkpoint and logs to the wandb id of the last run. - Changes to
train_stage.py
: Add option to provide a slurm job_id to resume job. - Changes to
infer_stage.py
: Explicitly provide a checkpoint to run inference, instead of searching for the latest checkpoint by default. (Could consider providing the checkpoint within the infer_config. - Changes to
eval_stage.py
: Explicitly provide a checkpoint to run evaluation, instead of searching for the latest checkpoint by default. - Addition of heterogeneous graph dataset and heterogeneous model under
stages/edge_classifier
- Changes to
edge_classifier_stage.py
: Move all plotting functions out of the LightningModule