Automatic dispatch of prefix_sum on GPU/CPU

In !1408 (merged) the GPU prefix_sum was improved and used in the SciFi decoding to compute the prefix_sum of sipms (4096 x n_events values). There might be other places where a GPU prefix_sum could be faster than the CPU (host) one.

Figure 39-7 of https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda gives a hint where that threshold could be :

In practice, that threshold may be even lower if we take into account data transfers.

From a configuration point of view, the GPU prefix_sum differs from the current host_prefix_sum in that it is not an algorithm, but a function that can be called inside any algorithm (see https://gitlab.cern.ch/lhcb/Allen/-/blob/master/device/SciFi/preprocessing/src/SciFiCalculateClusterCount.cu#L160 for instance). In practice, the host_prefix_sum algorithm is always run after an other algorithm which itself is never run without the prefix_sum. I propose to refactor all algorithms to have the prefix_sum always called from an other algorithm, which would simplify the python configuration.

I also propose to have a single prefix_sum function interface that would dispatch to cpu or gpu dynamically based on the size of the array.

Steps:

Add a test that benchmark both cpu and gpu prefix_sum vs the number of elements in the array.
Deduce the threshold (could be different depending on the hardware)
Refactor all algorithms to use a single prefix_sum function entry point
Make sure the gpu version can run on arrays of arbitrary size
Add an automatic dispatch based on the array size and deduced threshold

@dovombru @gligorov @raaij FYI

Edited Feb 21, 2024 by Arthur Marius Hennequin