Use CUDA's dynamic parallelism for dispatching line functions
In !1456 (merged) the run_lines kernel was optimized and its parallelization scheme was changed. It now launch 1 block per line and use threads to parallelize over events and objects.
This offers an opportunity to retire the big dispatch function invoke_line_functions
generated in https://gitlab.cern.ch/lhcb/Allen/-/blob/master/configuration/parser/ParseAlgorithms.py#L711 and go back to having a kernel per line.
The kernels would have to still be run in parallel to not loose any throughput. One way to achieve this is to use CUDA's dynamic parallelism (ie launch kernel from other kernel). https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-setup-apis
This would enable dynamic loading of modules containing new lines, which would bring us one step closer to lines code generation (@acasaisv @thboettc FYI)