Optimise run_lines
Throughput of branch ahennequ_lines (ecd8891f), sequence hlt1_pp_forward_then_matching over dataset upgrade_mc_minbias_scifi_v5_retinacluster_000_v1_newLHCbID_new_UT_geometry build options default:
NVIDIA GeForce RTX 3090 │█████████████████████████████████████ 123.86 kHz (1.09x)
NVIDIA RTX A5000 │█████████████████████████████ 97.77 kHz (1.08x)
NVIDIA GeForce RTX 2080 Ti │████████████████████████ 82.98 kHz (1.12x)
AMD EPYC 7502 32-Core │█ 5.03 kHz (1.00x)
┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼
0 20 40 60 80 100 120 140
Done:
- Store selection output in a bitmask (1bit per selection) instead of a bool array, this allows for some optimisations:
- Quickly count number of selected objects, 32 objects at a time, using
__popc
- Quickly find if the span is empty
- Quickly fill/init the span with a value
- Reduced memory footprint (factor 8)
- Quickly count number of selected objects, 32 objects at a time, using
- Grouped outputs that need to be in every lines into a single struct, so that we can modify them without refactoring all the lines
- Moved the prescaler and event list mask into an other kernel
- In run_line: use 1 block per line (more friendly with instruction cache)
- generate an event list as output of the prescaler instead of a mask, run on it instead of rechecking each event in run_lines
- dynamic thread balancing between events and objects
- adapt selection documentation
Follow-up:
- test using kernels instead of dispatched device functions, using dynamic parallelism #503
Motivation for dynamic thread balancing:
line 1, n_events_list=500 n_events_prescalled=500 n_average_objects=46
line 8, n_events_list=500 n_events_prescalled=500 n_average_objects=1 <= optimal block shape: (256 x 1)
line 30, n_events_list=500 n_events_prescalled=500 n_average_objects=46
line 25, n_events_list=500 n_events_prescalled=92 n_average_objects=18
line 21, n_events_list=500 n_events_prescalled=500 n_average_objects=46
line 38, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 35, n_events_list=500 n_events_prescalled=1 n_average_objects=331 <= optimal block shape: (1 x 256)
line 39, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 50, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 31, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 22, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 34, n_events_list=500 n_events_prescalled=500 n_average_objects=331
line 33, n_events_list=500 n_events_prescalled=500 n_average_objects=240
line 41, n_events_list=500 n_events_prescalled=500 n_average_objects=18
Edited by Arthur Marius Hennequin