Improve throughput of forward tracking's triplet search (!1735) · Merge requests · LHCb / Allen

Arthur Marius Hennequin requested to merge ahennequ_forward into 2024-patches Aug 01, 2024

Use a more direct approach to find triplets candidates. Instead of the 2 stage algorithm that finds doublet first, storing them in shared memory, then finding the 3rd hit, this version find triplets on the fly without intermediate storage. This allows to remove the hard limit on the number of doublets. The benefits are:

No technical (non-physical) cut on number of doublets
Less shared memory used allows to launch more block in parallel
Less computation per thread needed to check candidate validity

This version works by reversing the check, instead of extrapolating each doublet to magnet and checking if they are in the interval, bounds are computed for each first hit, by definition all second hit within the bounds are valid, so no further check is needed. For with_ut, the bounds computation is a bit more complex as it has to take into account the sagitta correction and the tx sign.

Throughput of branch ahennequ_forward (3a1e8042), sequence hlt1_pp_forward_then_matching_no_ut over dataset Beam6800GeV-expected-2024-MagDown-nu7.6_MinBiasMD build options default:

NVIDIA GeForce RTX 3090    │█████████████████████████████████████████        102.93 kHz (1.01x)
NVIDIA RTX A5000           │██████████████████████████████████               85.20 kHz (1.03x)
NVIDIA GeForce RTX 2080 Ti │█████████████████████████                        62.67 kHz (1.08x)
AMD EPYC 7502 32-Core      │███                                              8.35 kHz (1.44x)
                           ┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼
                           0       20      40      60      80     100     120

https://mattermost.web.cern.ch/lhcb/pl/rxwso5c8bbnbdx4x1o7baked7o

Throughput of branch ahennequ_forward (3a1e8042), sequence hlt1_pp_forward_then_matching over dataset Beam6800GeV-expected-2024-MagDown-nu7.6_MinBiasMD build options default:

NVIDIA GeForce RTX 3090    │██████████████████████████████████████████       107.48 kHz (1.02x)
NVIDIA RTX A5000           │███████████████████████████████████              89.80 kHz (1.04x)
NVIDIA GeForce RTX 2080 Ti │█████████████████████████                        64.46 kHz (1.03x)
AMD EPYC 7502 32-Core      │███                                              9.79 kHz (1.08x)
                           ┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼
                           0       20      40      60      80     100     120

https://mattermost.web.cern.ch/lhcb/pl/fjr4kybz5pyk3yohi94ccbsqic

Small changes in physics efficiencies expected as technical cut caused by shared memory size limitation is removed.

FYI @ascarabo @gligorov @sstahl

Edited Aug 21, 2024 by Arthur Marius Hennequin

Improve throughput of forward tracking's triplet search

Merge request reports