Problem with PyTorch lightning model training on GPU with slurm batch jobs
I am getting the same error (see below) for training the PyTorch lightning model with the latest acorn dev branch.
In the configuration, I use:
accelerator: cuda
devices: 1
nodes: 1
The job script for slurm batch is:
#!/bin/bash
#SBATCH -A m2616_g
#SBATCH -q regular
#SBATCH -C gpu&hbm80g
#SBATCH -t 12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH -c 32
#SBATCH -o logs/%x-%j.out
#SBATCH -J Acorn-train
#SBATCH --gpu-bind=none
#SBATCH --comment=96:00:00
#SBATCH --signal=SIGUSR1@90
#SBATCH --requeue
# This is a generic script for submitting training jobs to Cori-GPU.
# You need to supply the config file with this script.
# Setup
mkdir -p logs
eval "$(conda shell.bash hook)"
# module load python/3.9-anaconda-2021.11
conda activate acorn
export SLURM_CPU_BIND="cores"
export WANDB__SERVICE_WAIT=300
echo -e "\nStarting training\n"
# Single GPU training
srun acorn train $@
Below is my acorn environment:
Some more information:
- This error only occurs when I submit to slurm batch on Perlmutter, and there is no problem when I run it interactively.
- I get the same error for both 1 GPU and >=1 GPUs.
- The exact same job script I can run it with the old acorn version (PyTorch 1)
- I can run the py module map with slurm batch without problems.