AMD improvements (!574) · Merge requests · LHCb / Allen

Improvements to AMD MI100:

Use shfl_down intrinsics in pv_beamline_multi_fitter on AMD hardware.
Optimized is_muon to have better memory efficiency by using a one-dimensional block dimension, iterating over stations first, and caching muon_foi in shared memory.
Set launch_bounds in LFTripletSeeding.
Set default block_dim_y to 128 on UT SearchWindows.
Set default block_dim to 1024 on VeloConsolidateTracks.
Use latest ROCm release 4.2.0 (thanks @rschwemm and @bcouturi).
Use the following launch parameters on AMD hardware: HSA_NO_SCRATCH_RECLAIM=1 GPU_MAX_HW_QUEUES=8 HIP_VISIBLE_DEVICES=2 numactl --cpunodebind=1 --membind=1 ./Allen -f /scratch/allen_data/minbias_mag_down_201907 -n 5000 --events-per-slice 5000 -t 10 -r 1000 -m 3000

Many thanks to Adil Lashab (AMD) for the help in spotting some of these optimizations and for code change suggestions.

AMD improvements