Stale ROCm jobs
It seems that some HIP jobs become stale:
As a consequence, some HIP jobs fail due to "out of memory", which is natural given that other job is already using all the space available in the GPU:
Eg. https://gitlab.cern.ch/lhcb/Allen/-/jobs/20886803
It would be useful to print the GPU usage in HIP similarly to how it is done in CUDA (ie. using rocm-smi
). (CC @roneil )
A proper fix however would require knowing the reason for these stales and fixing it.