Use shfl warp primitives to speedup pv beamline multi fitter.
This MR uses shuffle warp primitives to speedup the PV beamline multi fitter.
To be clear, this is a CUDA-specific optimization that is only enabled when compiling with CUDA.
- If compiled with HIP: The property "block_dim_y" can be used to configure how many seeds are executed in parallel.
- If compiled with CPU: Regardless of "block_dim_y", it will be executed in a single thread (as with any other kernel).
With CUDA:
- Parallelization is done at two levels now on the multi fitter:
- Each seed in parallel (as before), configurable with property "block_dim_y" (by default 4).
- Within each seed, a warp of 32 threads iterates tracks in parallel. The number of threads in a warp is hard-coded.
The reason the number of threads is hard-coded to 32 is that CUDA warp level primitives are used to speedup the code (https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/), which would not work with any other number of threads in block dimension X.
Note that by default the kernel is now executed with 32 * 4 threads with the CUDA backend, and with 4 threads with the HIP backend.
On another note, HIP still does not support shuffle primitives, and this is an open issue: https://github.com/ROCm-Developer-Tools/HIP/issues/1491. If it becomes supported, we could extend the code to make it work in HIP as well.