MI100 failure in run_throughput CI test
The run_throughput test fails intermittently for the MI100 with one of the following messages:
- See for example here
scripts/ci/jobs/run_throughput.sh: line 119: 6554 Aborted HSA_NO_SCRATCH_RECLAIM=1 GPU_MAX_HW_QUEUES=8 HIP_VISIBLE_DEVICES=0 numactl --cpunodebind=1 --membind=1 ./toolchain/wrapper ./Allen --mdf /scratch/allen_data/mdf_input/upgrade_mc_minbias_scifi_v5_retinacluster_000_v1.mdf --sequence hlt1_pp_default --run-from-json 1 --params ../input/PARAM/ParamFiles/ -n 2800 --events-per-slice 2800 -m 2800 -t 10 -r 100
- See for example here
Failed to run hipMalloc(devPtr, size)
hipErrorOutOfMemory (2) at ../backend/include/HIPBackend.h: 136
terminate called after throwing an instance of 'std::invalid_argument'
what(): hipCheck failed
The first failures appeared after the merge of !772 (merged)
Edited by Dorothea Vom Bruch