AMD improvements
Improvements to AMD MI100:
- Use
shfl_down
intrinsics inpv_beamline_multi_fitter
on AMD hardware. - Optimized
is_muon
to have better memory efficiency by using a one-dimensional block dimension, iterating over stations first, and caching muon_foi in shared memory. - Set
launch_bounds
inLFTripletSeeding
. - Set default block_dim_y to 128 on UT SearchWindows.
- Set default block_dim to 1024 on VeloConsolidateTracks.
- Use latest ROCm release 4.2.0 (thanks @rschwemm and @bcouturi).
- Use the following launch parameters on AMD hardware:
HSA_NO_SCRATCH_RECLAIM=1 GPU_MAX_HW_QUEUES=8 HIP_VISIBLE_DEVICES=2 numactl --cpunodebind=1 --membind=1 ./Allen -f /scratch/allen_data/minbias_mag_down_201907 -n 5000 --events-per-slice 5000 -t 10 -r 1000 -m 3000
Many thanks to Adil Lashab (AMD) for the help in spotting some of these optimizations and for code change suggestions.
Edited by Daniel Hugo Campora Perez