Draft: Use CUDA's dynamic parallelism for dispatching line functions (!1458) · Merge requests · LHCb / Allen

Arthur Marius Hennequin requested to merge ahennequ_lines2 into 2024-patches Feb 26, 2024

Closes #503

No throughput difference compared to !1456 (merged) (or maybe slightly faster):

NVIDIA GeForce RTX 3090    │█████████████████████████████████████      124.64 kHz
NVIDIA RTX A5000           │█████████████████████████████              99.16 kHz
NVIDIA GeForce RTX 2080 Ti │████████████████████████                   82.53 kHz
AMD EPYC 7502 32-Core      │███                                        10.67 kHz
                           ┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼
                           0     20    40    60    80   100   120   140

TODO:

try launching line kernel with block size tailored to the amount of objects and event list
try line JIT compile/load/run using nvrtc https://docs.nvidia.com/cuda/archive/10.1/pdf/NVRTC_User_Guide.pdf

Edited Feb 26, 2024 by Arthur Marius Hennequin

Draft: Use CUDA's dynamic parallelism for dispatching line functions

Merge request reports