This must be tested with Moore!2337 (merged)
This is the first full implementation of downstream / long lived particle track reconstruction at HLT1 level. Details in the presentation at first RTA WP2 meeting of 2023 followed by 107th LHCb week and RTA WP2 30May23
This MR consists:
- Downstream track reconstruction in Allen:
- Two algorihtms: DownstreamFindHits, DownstreamCreateTracks
- Three kernel functions: downstream_create_output_table, downstream_fill_output_table, downstream_create_tracks
- New reconstruction sequence:
- Downstream track consolidation: extending the UT track struct
- Two algorithms: DownstreamConsolidate, DownstreamCopyHitNumber
- Two kernel functions: downstream_consolidate, downstream_copy_hit_number
- Downstream track validator:
- New validation sequence:
- New validation sequence:
- Downstream track dumpper: before reconstruction and after reconstruction.
- Two algorithms: HostDownstreamDump, HostPreDownstreamDump
- New dumpper sequences:
- A bash script for power consumption measurement:
This downstream track reconstruction algorithm is part of the HybridSeeding project and also form part of long living particle tracking project, running after the velo-scifi matching algorithm and on top of SciFi seeding !742 (merged), taking the output of the SciFi seeding, filters out the used seeds by long tracks reconstruction and extrapolates the remaining seeds to UT stations to create downstream tracks. The idea of extrapolation is inspired by the HLT2 downstream reconstruction, the algorithm is redesigned for GPU parallel architecture and all parametrization is optimized using
In order to achieve the target throughput, certain CUDA level optimizations have been considered:
- Cache UT hits in the shared memory of GPU (up-to 1024 hits per layer is allowed). This idea is inspired from Arthur Hennequin's talk about seeding optimization.
- Split the parallelization level: 3 kernel functions correspond 3 different parallelization levels.
- Fast clone killing using shared memory.
- Struct Of Array (SOA) internal struct: allowing the CUDA memory coalescing for data exchange between different kernel functions.
- Compact memory struct: half_t, ushort, uint8_t if it's possible.
The algorithm is implemented using 3 main kernel functions:
The first kernel called
downstream_create_output_tableis run in parallel for 128 SciFi seeds per event and performs,
- Pre-filtering of unmatched seeds removing low momentum candidates with p < 1400 MeV and pt < 400 MeV.
- Extrapolate each SciFi seed candidates to the last UT X-layer.
- Correct the extrapolation slope using each new hit and update the slope.
- Remove the tracks originating from beam pipe and out of UT acceptance.
The second kernel called
downstream_fill_output_tableis run in parallel for candidates after matching the last UT hits for up-to 256 candidates per event and performs,
- Adds hits from the rest of the layers, first from the remaining X layer and then from remaining UV-layers.
- for every hit added, computes and updates the scores based on the distance between extrapolation and real UT hit positions.
The Third kernel is called
downstream_create_tracksand runs in parallel for 128 track candidates per event.
- Select the best scored track candidate for each SciFi seed.
- Performs the fast clone killing implemented in shared memory.
- Computes the qop and chi2 and builds tracks.
After this there is a consolidation step which performs multi-event consolidation of the downstream tracks which are currently being used for development of selection lines. This is done by extending the existing
Allen::Views::UT::Consolidated::Track struct with extra SciFi information and parameters such that this change is fully backward compatible with the existing
CompassUT algorithm and can be easily integrated into
BasicParticle in Allen.
In addition, to the above, this MR also provides a script which can be used for calculating the power consumption of the Allen run per GPU which thus can be translated to algorithm level power consumption. We want to include this figure of merit when we optimize and present our HLT1 software performance. This algorithm adds about 8 watts of additional consumption in comparison to the base sequence. This information can be used to find out right levels of tolerance based on physics performance, throughput and power utilization.
The throughput of the algorithm has been measured using
MinBias samples and physics performance has been measured using