Use shfl warp primitives to speedup pv beamline multi fitter. (!412) · Merge requests · LHCb / Allen

This MR uses shuffle warp primitives to speedup the PV beamline multi fitter.

To be clear, this is a CUDA-specific optimization that is only enabled when compiling with CUDA.

If compiled with HIP: The property "block_dim_y" can be used to configure how many seeds are executed in parallel.
If compiled with CPU: Regardless of "block_dim_y", it will be executed in a single thread (as with any other kernel).

With CUDA:

Parallelization is done at two levels now on the multi fitter:
Each seed in parallel (as before), configurable with property "block_dim_y" (by default 4).
Within each seed, a warp of 32 threads iterates tracks in parallel. The number of threads in a warp is hard-coded.

The reason the number of threads is hard-coded to 32 is that CUDA warp level primitives are used to speedup the code (https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/), which would not work with any other number of threads in block dimension X.

Note that by default the kernel is now executed with 32 * 4 threads with the CUDA backend, and with 4 threads with the HIP backend.

On another note, HIP still does not support shuffle primitives, and this is an open issue: https://github.com/ROCm-Developer-Tools/HIP/issues/1491. If it becomes supported, we could extend the code to make it work in HIP as well.

Edited Jun 23, 2020 by Daniel Hugo Campora Perez

Use shfl warp primitives to speedup pv beamline multi fitter.

Merge request reports