Delayed selections (!846) · Merge requests · LHCb / Allen

This MR implements delayed selection algorithm execution to improve its performance and scalability. It maintains the same configurability as of now.

Selections are executed in the posterior Gather Selections algorithm, in a single kernel.
Performance drastically improves. Up to 100 lines have been tested with a performance impact of about 6% with respect to 1 line.
All selection algorithm initializations have been moved to the kernel execution.
All selection algorithm copies have been moved to GatherSelections.
Separable compilation is now enabled, enabled by default.
An option to compile with / without separable compilation has been added. If separable compilation is disabled, a custom "unity" build is instantiated which joins all source files of the selections library.
HIP does not support separable compilation at the moment, and hence must be compiled with separable compilation disabled (set at configuration time automatically).
Code-generation of a new file ExternLines.cuh is also necessary (unfortunately) to allow invoking a function defined in a separate compilation unit (see https://forums.developer.nvidia.com/t/consistency-of-functions-pointer/29325/6).

Done in collaboration with @ahennequ.

TODO:

Performance of `hlt1_pp_default`:

Device-averaged speedup: 1.0685950709991698
               % change: 6.859507099916984
NVIDIA RTX A5000  speedup (% change): 1.040242194593474 (4.024219459347389%)
NVIDIA RTX A6000  speedup (% change): 1.1261653368244064 (12.616533682440645%)
AMD EPYC 7502 32-Core  speedup (% change): 0.9943645196516324 (-0.5635480348367583%)
NVIDIA GeForce RTX 2080 Ti  speedup (% change): 1.0576478278517496 (5.76478278517496%)
NVIDIA GeForce RTX 3090  speedup (% change): 1.1245554760745868 (12.455547607458684%)

Edited Apr 27, 2022 by Daniel Hugo Campora Perez

Delayed selections

Performance of hlt1_pp_default:

Merge request reports

Performance of `hlt1_pp_default`: