Perform parameter scan of __launch_bounds__ and number of threads for every algorithm
CUDA allows to define __launch_bounds__ individually for each algorithm. This may have an impact in the performance of the application.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds
Similarly, a proper parameter scan should be done over the number of threads of every kernel.
Edited by Daniel Hugo Campora Perez