Improvements to prefix sum and sorting algorithms
Introduce a new test and benchmark to compare different implementations of the prefix_sum:
- cpu1: default implementation of host_prefix_sum
- cuda1: blelloch's scan implementation using 1 element per thread
- cuda2: blelloch's scan implementation using 4 element per thread
- cuda3: blelloch's scan implementation using a single kernel, sliding on the array
Closes #500 (closed)
Implements a new sorting algorithm.
Implements a new velo clustering algorithm.
Edited by Arthur Marius Hennequin