Improve throughput of forward tracking's triplet search
Use a more direct approach to find triplets candidates. Instead of the 2 stage algorithm that finds doublet first, storing them in shared memory, then finding the 3rd hit, this version find triplets on the fly without intermediate storage. This allows to remove the hard limit on the number of doublets. The benefits are:
- No technical (non-physical) cut on number of doublets
- Less shared memory used allows to launch more block in parallel
- Less computation per thread needed to check candidate validity
This version works by reversing the check, instead of extrapolating each doublet to magnet and checking if they are in the interval, bounds are computed for each first hit, by definition all second hit within the bounds are valid, so no further check is needed. For with_ut, the bounds computation is a bit more complex as it has to take into account the sagitta correction and the tx sign.
Throughput of branch ahennequ_forward (3a1e8042), sequence hlt1_pp_forward_then_matching_no_ut over dataset Beam6800GeV-expected-2024-MagDown-nu7.6_MinBiasMD build options default:
NVIDIA GeForce RTX 3090 │█████████████████████████████████████████ 102.93 kHz (1.01x)
NVIDIA RTX A5000 │██████████████████████████████████ 85.20 kHz (1.03x)
NVIDIA GeForce RTX 2080 Ti │█████████████████████████ 62.67 kHz (1.08x)
AMD EPYC 7502 32-Core │███ 8.35 kHz (1.44x)
┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼
0 20 40 60 80 100 120
https://mattermost.web.cern.ch/lhcb/pl/rxwso5c8bbnbdx4x1o7baked7o
Throughput of branch ahennequ_forward (3a1e8042), sequence hlt1_pp_forward_then_matching over dataset Beam6800GeV-expected-2024-MagDown-nu7.6_MinBiasMD build options default:
NVIDIA GeForce RTX 3090 │██████████████████████████████████████████ 107.48 kHz (1.02x)
NVIDIA RTX A5000 │███████████████████████████████████ 89.80 kHz (1.04x)
NVIDIA GeForce RTX 2080 Ti │█████████████████████████ 64.46 kHz (1.03x)
AMD EPYC 7502 32-Core │███ 9.79 kHz (1.08x)
┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼
0 20 40 60 80 100 120
https://mattermost.web.cern.ch/lhcb/pl/fjr4kybz5pyk3yohi94ccbsqic
Small changes in physics efficiencies expected as technical cut caused by shared memory size limitation is removed.