ROCm 5 memory corruption issue with contracts
Since moving to ROCm 5 there are two jobs that consistently fail:
Upon closer inspection, a build with ROCm 4.2.0, another with ROCm 5.0.0 and another with ROCm 5.0.2 have been tested. The build is done with the parameters from one of the tests, namely:
source /cvmfs/lhcb.cern.ch/lib/LbEnv
source /cvmfs/lhcbdev.cern.ch/tools/rocm-5.0.0/setenv.sh
export CMAKE_TOOLCHAIN_FILE=/cvmfs/lhcb.cern.ch/lib/lhcb/lcg-toolchains/LCG_101/x86_64-centos7-clang12-opt.cmake
cmake -DSTANDALONE=ON -DTARGET_DEVICE=HIP -DBUILD_TESTING=ON -DENABLE_CONTRACTS=ON -GNinja -DSEQUENCES=all .. && ninja
numactl --cpunodebind=1 --membind=1 ./toolchain/wrapper ./Allen --mdf /scratch/allen_data/mdf_input/upgrade_mc_minbias_scifi_v5_retinacluster_000.mdf --sequence hlt1_pp_default --run-from-json 1 -n 2800 --events-per-slice 2800 -m 2800 -t 10 -r 1000
- The build with ROCm 4.2.0 seems to work fine.
- The builds with either ROCm 5.0.0 or ROCm 5.0.2 fail with one of the following errors which happen at random:
terminate called after throwing an instance of 'Allen::contract::ContractException'
what(): Contract exception in algorithm prefix_sum_ut_tracks, postcondition Allen::contract::consecutive_condition<host_prefix_sum::Parameters::host_output_buffer_t, host_prefix_sum::Parameters, Allen::contract::Postcondition, std::greater_equal<unsigned int> >: Require condition std::greater_equal<unsigned int> on consecutive elements of prefix_sum_ut_tracks__host_output_buffer_t
Aborted
:0:rocdevice.cpp :2603: 0439032652 us: 19441: [tid:0x7f988690c700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
Rolling back to 4.2.0 fixes that but has the undesired effect that one of the builds do not properly compile since !797 (merged), see concretely: !797 (comment 5404854).
The way forward I suggest is to move to ROCm 5.x and revisit this error once the next version of HIP is released.