Skip to content

Optimise run_lines

Arthur Marius Hennequin requested to merge ahennequ_lines into master
Throughput of branch ahennequ_lines (ecd8891f), sequence hlt1_pp_forward_then_matching over dataset upgrade_mc_minbias_scifi_v5_retinacluster_000_v1_newLHCbID_new_UT_geometry build options default:
NVIDIA GeForce RTX 3090    │█████████████████████████████████████      123.86 kHz (1.09x)
NVIDIA RTX A5000           │█████████████████████████████              97.77 kHz (1.08x)
NVIDIA GeForce RTX 2080 Ti │████████████████████████                   82.98 kHz (1.12x)
AMD EPYC 7502 32-Core      │█                                          5.03 kHz (1.00x)
                           ┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼──┴──┼
                           0     20    40    60    80   100   120   140 

Done:

  • Store selection output in a bitmask (1bit per selection) instead of a bool array, this allows for some optimisations:
    • Quickly count number of selected objects, 32 objects at a time, using __popc
    • Quickly find if the span is empty
    • Quickly fill/init the span with a value
    • Reduced memory footprint (factor 8)
  • Grouped outputs that need to be in every lines into a single struct, so that we can modify them without refactoring all the lines
  • Moved the prescaler and event list mask into an other kernel
  • In run_line: use 1 block per line (more friendly with instruction cache)
  • generate an event list as output of the prescaler instead of a mask, run on it instead of rechecking each event in run_lines
  • dynamic thread balancing between events and objects
  • adapt selection documentation

Follow-up:

  • test using kernels instead of dispatched device functions, using dynamic parallelism #503

Motivation for dynamic thread balancing:

line 1, n_events_list=500 n_events_prescalled=500 n_average_objects=46
line 8, n_events_list=500 n_events_prescalled=500 n_average_objects=1 <= optimal block shape: (256 x 1)
line 30, n_events_list=500 n_events_prescalled=500 n_average_objects=46
line 25, n_events_list=500 n_events_prescalled=92 n_average_objects=18
line 21, n_events_list=500 n_events_prescalled=500 n_average_objects=46
line 38, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 35, n_events_list=500 n_events_prescalled=1 n_average_objects=331 <= optimal block shape: (1 x 256)
line 39, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 50, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 31, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 22, n_events_list=500 n_events_prescalled=500 n_average_objects=18
line 34, n_events_list=500 n_events_prescalled=500 n_average_objects=331
line 33, n_events_list=500 n_events_prescalled=500 n_average_objects=240
line 41, n_events_list=500 n_events_prescalled=500 n_average_objects=18

FYI @dovombru @cagapopo @raaij @gligorov

Edited by Arthur Marius Hennequin

Merge request reports