Snippets Groups Projects

Use CUDA's dynamic parallelism for dispatching line functions

In !1456 (merged) the run_lines kernel was optimized and its parallelization scheme was changed. It now launch 1 block per line and use threads to parallelize over events and objects.

This offers an opportunity to retire the big dispatch function invoke_line_functions generated in https://gitlab.cern.ch/lhcb/Allen/-/blob/master/configuration/parser/ParseAlgorithms.py#L711 and go back to having a kernel per line.

The kernels would have to still be run in parallel to not loose any throughput. One way to achieve this is to use CUDA's dynamic parallelism (ie launch kernel from other kernel). https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-setup-apis

This would enable dynamic loading of modules containing new lines, which would bring us one step closer to lines code generation (@acasaisv @thboettc FYI)

@dovombru @raaij @cagapopo @gligorov

Designs

Child items ...

Activity

Arthur Marius Hennequin mentioned in merge request !1456 (merged) 10 months ago

mentioned in merge request !1456 (merged)
Arthur Marius Hennequin @ahennequ · 10 months ago

Author Developer

Resolved 10 months ago by Arthur Marius Hennequin

I did a few tests, here is what I found:

To be able to use dynamic parallelism, we need to drop support for sm_53 which seems to not be supported

Launching a (1x256) kernel from a (n_linesx1) kernel seems to be working fine with no throughput loss.

But if the launched kernel is different for each line, the throughput drops a lot, as if concurrency was disabled in that case.

Last reply by Arthur Marius Hennequin 10 months ago
Arthur Marius Hennequin mentioned in merge request !1458 10 months ago

mentioned in merge request !1458

Please register or sign in to reply

Epic

None

Labels

None

Milestone

None

Iteration

None

Weight

None

Due date

None

Health status

None

Confidentiality

Not confidential

0 Participants