Skip to content

MI100 failure in run_throughput CI test

The run_throughput test fails intermittently for the MI100 with one of the following messages:

  1. See for example here
scripts/ci/jobs/run_throughput.sh: line 119:  6554 Aborted                 HSA_NO_SCRATCH_RECLAIM=1 GPU_MAX_HW_QUEUES=8 HIP_VISIBLE_DEVICES=0 numactl --cpunodebind=1 --membind=1 ./toolchain/wrapper ./Allen --mdf /scratch/allen_data/mdf_input/upgrade_mc_minbias_scifi_v5_retinacluster_000_v1.mdf --sequence hlt1_pp_default --run-from-json 1 --params ../input/PARAM/ParamFiles/ -n 2800 --events-per-slice 2800 -m 2800 -t 10 -r 100
  1. See for example here
Failed to run hipMalloc(devPtr, size)
hipErrorOutOfMemory (2) at ../backend/include/HIPBackend.h: 136
terminate called after throwing an instance of 'std::invalid_argument'
  what():  hipCheck failed

The first failures appeared after the merge of !772 (merged)

Edited by Dorothea Vom Bruch