Skip to content

AMD improvements

Daniel Hugo Campora Perez requested to merge dcampora_amd_improvements into master

Improvements to AMD MI100:

  • Use shfl_down intrinsics in pv_beamline_multi_fitter on AMD hardware.
  • Optimized is_muon to have better memory efficiency by using a one-dimensional block dimension, iterating over stations first, and caching muon_foi in shared memory.
  • Set launch_bounds in LFTripletSeeding.
  • Set default block_dim_y to 128 on UT SearchWindows.
  • Set default block_dim to 1024 on VeloConsolidateTracks.
  • Use latest ROCm release 4.2.0 (thanks @rschwemm and @bcouturi).
  • Use the following launch parameters on AMD hardware: HSA_NO_SCRATCH_RECLAIM=1 GPU_MAX_HW_QUEUES=8 HIP_VISIBLE_DEVICES=2 numactl --cpunodebind=1 --membind=1 ./Allen -f /scratch/allen_data/minbias_mag_down_201907 -n 5000 --events-per-slice 5000 -t 10 -r 1000 -m 3000

Many thanks to Adil Lashab (AMD) for the help in spotting some of these optimizations and for code change suggestions.

Edited by Daniel Hugo Campora Perez

Merge request reports