Improvements to prefix sum and sorting algorithms (!1509) · Merge requests · LHCb / Allen

Introduce a new test and benchmark to compare different implementations of the prefix_sum:

cpu1: default implementation of host_prefix_sum
cuda1: blelloch's scan implementation using 1 element per thread
cuda2: blelloch's scan implementation using 4 element per thread
cuda3: blelloch's scan implementation using a single kernel, sliding on the array

Implements a new sorting algorithm.

Implements a new velo clustering algorithm.

Improvements to prefix sum and sorting algorithms