Delayed selections
This MR implements delayed selection algorithm execution to improve its performance and scalability. It maintains the same configurability as of now.
- Selections are executed in the posterior Gather Selections algorithm, in a single kernel.
- Performance drastically improves. Up to 100 lines have been tested with a performance impact of about
6%
with respect to 1 line. - All selection algorithm initializations have been moved to the kernel execution.
- All selection algorithm copies have been moved to GatherSelections.
- Separable compilation is now enabled, enabled by default.
- An option to compile with / without separable compilation has been added. If separable compilation is disabled, a custom "unity" build is instantiated which joins all source files of the selections library.
- HIP does not support separable compilation at the moment, and hence must be compiled with separable compilation disabled (set at configuration time automatically).
- Code-generation of a new file
ExternLines.cuh
is also necessary (unfortunately) to allow invoking a function defined in a separate compilation unit (see https://forums.developer.nvidia.com/t/consistency-of-functions-pointer/29325/6).
Done in collaboration with @ahennequ.
TODO:
-
Optimize performance -
CPU compatibility -
Manage lifetime of objects used in selections -
Bring back monitoring functionality -
HIP build -
HIP runs
hlt1_pp_default
:
Performance of Device-averaged speedup: 1.0685950709991698
% change: 6.859507099916984
NVIDIA RTX A5000 speedup (% change): 1.040242194593474 (4.024219459347389%)
NVIDIA RTX A6000 speedup (% change): 1.1261653368244064 (12.616533682440645%)
AMD EPYC 7502 32-Core speedup (% change): 0.9943645196516324 (-0.5635480348367583%)
NVIDIA GeForce RTX 2080 Ti speedup (% change): 1.0576478278517496 (5.76478278517496%)
NVIDIA GeForce RTX 3090 speedup (% change): 1.1245554760745868 (12.455547607458684%)
Edited by Daniel Hugo Campora Perez