Optimise run_lines (!1456) · Merge requests · LHCb / Allen

Arthur Marius Hennequin requested to merge ahennequ_lines into master Feb 22, 2024

Throughput of branch ahennequ_lines (ecd8891f), sequence hlt1_pp_forward_then_matching over dataset upgrade_mc_minbias_scifi_v5_retinacluster_000_v1_newLHCbID_new_UT_geometry build options default:
NVIDIA GeForce RTX 3090    │█████████████████████████████████████      123.86 kHz (1.09x)
NVIDIA RTX A5000           │█████████████████████████████              97.77 kHz (1.08x)
NVIDIA GeForce RTX 2080 Ti │████████████████████████                   82.98 kHz (1.12x)
AMD EPYC 7502 32-Core      │█                                          5.03 kHz (1.00x)
                           ┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼
                           0     20    40    60    80   100   120   140

Done:

Store selection output in a bitmask (1bit per selection) instead of a bool array, this allows for some optimisations:
- Quickly count number of selected objects, 32 objects at a time, using __popc
- Quickly find if the span is empty
- Quickly fill/init the span with a value
- Reduced memory footprint (factor 8)
Grouped outputs that need to be in every lines into a single struct, so that we can modify them without refactoring all the lines
Moved the prescaler and event list mask into an other kernel
In run_line: use 1 block per line (more friendly with instruction cache)
generate an event list as output of the prescaler instead of a mask, run on it instead of rechecking each event in run_lines
dynamic thread balancing between events and objects
adapt selection documentation

Follow-up:

test using kernels instead of dispatched device functions, using dynamic parallelism #503

Motivation for dynamic thread balancing:

line 1, n_events_list=500 n_events_prescalled=500 n_average_objects=46
line 8, n_events_list=500 n_events_prescalled=500 n_average_objects=1 <= optimal block shape: (256 x 1)
line 30, n_events_list=500 n_events_prescalled=500 n_average_objects=46
line 25, n_events_list=500 n_events_prescalled=92 n_average_objects=18
line 21, n_events_list=500 n_events_prescalled=500 n_average_objects=46
line 38, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 35, n_events_list=500 n_events_prescalled=1 n_average_objects=331 <= optimal block shape: (1 x 256)
line 39, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 50, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 31, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 22, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 34, n_events_list=500 n_events_prescalled=500 n_average_objects=331
line 33, n_events_list=500 n_events_prescalled=500 n_average_objects=240
line 41, n_events_list=500 n_events_prescalled=500 n_average_objects=18

FYI @dovombru @cagapopo @raaij @gligorov

Edited Feb 24, 2024 by Arthur Marius Hennequin

Optimise run_lines

Merge request reports