Skip to content

AMD improvements

Daniel Hugo Campora Perez requested to merge dcampora_amd_improvements into master

Improvements to AMD MI100:

  • Use shfl_down intrinsics in pv_beamline_multi_fitter on AMD hardware.
  • Optimized is_muon to have better memory efficiency by using a one-dimensional block dimension, iterating over stations first, and caching muon_foi in shared memory.
  • Set launch_bounds in LFTripletSeeding.
  • Set default block_dim_y to 128 on UT SearchWindows.
  • Set default block_dim to 1024 on VeloConsolidateTracks.
  • Use latest ROCm release 4.2.0 (thanks @rschwemm and @bcouturi).
  • Use the following launch parameters on AMD hardware: HSA_NO_SCRATCH_RECLAIM=1 GPU_MAX_HW_QUEUES=8 HIP_VISIBLE_DEVICES=2 numactl --cpunodebind=1 --membind=1 ./Allen -f /scratch/allen_data/minbias_mag_down_201907 -n 5000 --events-per-slice 5000 -t 10 -r 1000 -m 3000

Many thanks to Adil Lashab (AMD) for the help in spotting some of these optimizations and for code change suggestions.

Edited by Daniel Hugo Campora Perez

Merge request reports

Loading