Skip to content

Draft: Improvements to prefix sum and sorting algorithms

Arthur Marius Hennequin requested to merge ahennequ_scan into 2024-patches

Introduce a new test and benchmark to compare different implementations of the prefix_sum:

image

  • cpu1: default implementation of host_prefix_sum
  • cuda1: blelloch's scan implementation using 1 element per thread
  • cuda2: blelloch's scan implementation using 4 element per thread
  • cuda3: blelloch's scan implementation using a single kernel, sliding on the array

TODO: use gpu prefix sum everywhere instead of host_prefix_sum algorithm

Closes #500

FYI @gligorov @raaij @dovombru @cagapopo

Edited by Arthur Marius Hennequin

Merge request reports