Perform parameter scan of __launch_bounds__ and number of threads for every algorithm

CUDA allows to define __launch_bounds__ individually for each algorithm. This may have an impact in the performance of the application.

Similarly, a proper parameter scan should be done over the number of threads of every kernel.