Draft: Use CUDA's dynamic parallelism for dispatching line functions
Closes #503
Depends on !1456 (merged)
No throughput difference compared to !1456 (merged) (or maybe slightly faster):
NVIDIA GeForce RTX 3090 │█████████████████████████████████████ 124.64 kHz
NVIDIA RTX A5000 │█████████████████████████████ 99.16 kHz
NVIDIA GeForce RTX 2080 Ti │████████████████████████ 82.53 kHz
AMD EPYC 7502 32-Core │███ 10.67 kHz
┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼
0 20 40 60 80 100 120 140
TODO:
- try launching line kernel with block size tailored to the amount of objects and event list
- try line JIT compile/load/run using nvrtc https://docs.nvidia.com/cuda/archive/10.1/pdf/NVRTC_User_Guide.pdf
Edited by Arthur Marius Hennequin