Draft: Improvements to prefix sum and sorting algorithms
Introduce a new test and benchmark to compare different implementations of the prefix_sum:
- cpu1: default implementation of host_prefix_sum
- cuda1: blelloch's scan implementation using 1 element per thread
- cuda2: blelloch's scan implementation using 4 element per thread
- cuda3: blelloch's scan implementation using a single kernel, sliding on the array
TODO: use gpu prefix sum everywhere instead of host_prefix_sum algorithm
Closes #500