Use shfl warp primitives to speedup pv beamline multi fitter.

This MR uses shuffle warp primitives to speedup the PV beamline multi fitter.

To be clear, this is a CUDA-specific optimization that is only enabled when compiling with CUDA.

  • If compiled with HIP: The property "block_dim_y" can be used to configure how many seeds are executed in parallel.
  • If compiled with CPU: Regardless of "block_dim_y", it will be executed in a single thread (as with any other kernel).

With CUDA:

  • Parallelization is done at two levels now on the multi fitter:
  • Each seed in parallel (as before), configurable with property "block_dim_y" (by default 4).
  • Within each seed, a warp of 32 threads iterates tracks in parallel. The number of threads in a warp is hard-coded.

The reason the number of threads is hard-coded to 32 is that CUDA warp level primitives are used to speedup the code (https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/), which would not work with any other number of threads in block dimension X.

Note that by default the kernel is now executed with 32 * 4 threads with the CUDA backend, and with 4 threads with the HIP backend.

On another note, HIP still does not support shuffle primitives, and this is an open issue: https://github.com/ROCm-Developer-Tools/HIP/issues/1491. If it becomes supported, we could extend the code to make it work in HIP as well.

Edited by Daniel Hugo Campora Perez

Merge request reports

Loading