Optimized UT decoding, better coverage of UT hits test and fix to UT unique x sector
Changes:
- Optimized UT decoding, which is ported from !1621 (closed) to be independent of !1509 (closed) (optimization table copied below). Also, this MR will make merge conflicts in !1509 (closed) easier to resolve.
- UT hits are now sorted with a unique key and deterministic tiebreak.
- Improved UT hit test coverage. The test of comparing HLT1 vs HLT2 UT hits is more stringent now as it requires 100% overlap in LHCbID and no duplicated UT hit LHCbID.
- Increased x tolerance of creating unique UT sector group x. This will recover our efficiencies of finding UT hits with matching (see below). Also, the unique sector group x is now the smallest x value of the sector group, which should help binary search.
- Other small changes include code cleanup and refactoring.
More elaborate details below.
Works with Moore!3547 (merged).
FYI: @dovombru @mveghel @jzhuo @hawu @ldufour @cagapopo
Closes: !1621 (closed)
--- HLT1 Efficiencies With/Without UT ---
The following plots are made using matching_with_ut_clustering_opt_and_downstream
branch with the change in unique sector group x
from this MR. Should be tested again once matching with UT is merged to 2024-patches
. The only difference between orange and blue histogram is whether HLT1 uses UT hits. The input MEP file and HLT2 sequence is exactly the same.
Before, initially shown during LHCb Week.
After tuning x
tolerances of unique sector group, we recover a lot of charm mesons.
TrackMVA trigger rate on Run 295336 (mu=3).
Line | Without UT | With UT |
---|---|---|
Hlt1TrackMVA | 198.35 +/- 1.50 | 121.40 +/- 1.17 |
Hlt1TwoTrackMVA | 287.06 +/- 1.80 | 184.26 +/- 1.44 |
--- Throughput With/Without MR ---
Note: The optimizations reported in the table below is with/without the optimizations in this MR are tested with matching_with_ut_clustering_opt_and_downstream
branch (!1663 (merged) + !1509 (closed) + !1198 (merged)):
- Matching no UT: 108.7 kHz
- Matching with UT: 94.6 kHz
- Matching with UT + optimized UT decoding: 98.8 kHz
Unfortunately, we do not see such a large gain in ci-test of this MR so far, usually in the 1-3% ballpark (albeit with hlt1_pp_default
which is a different sequence). This MR reduces register pressures in UT decoding CUDA kernels which increases occupancy, so I suspect the optimizations in !1509 (closed) and this MR has larger speedup when put together rather than measured individually.
--- Copied from !1621 (closed) ---
UT decoding optimization:
- Removed UTBoards since we only need a sourceID -> geometry table map.
- Zero suppressed empty UT raw banks (no longer waste idle threads on them).
- Added warp atomics for UT decoding. Reusable by other parts of Allen.
- Binned UT raw banks so we decode large ones first, smaller ones later. This would equalize UT decoding workload while clustering.
- Reduced register usage (55 -> 53) in
UTClusterAndPreDecode
by using 16-bit integers instead of 32-bit where possible. - Very tiny (0.1 kHz) improvement from removing a loop over UT layer in
UTDecodeInOrder
, increasing occupancy of that kernel by 30%. - Opportunistic looping over UT raw bank lanes. The binning of UT lanes by size means the first warp will have more work to do. Rather than using a deterministic looping over UT lanes, opportunistic looping means that once all the threads in a warp has finished decoding 32 UT lanes (raw banks), you directly do an atomicAdd to get the next 32 lanes to decode. This should equalize the workload across the warps within a thread block.
Throughput running trackmatching_veloscifi_and_utdecoding
sequence on A5000
, each row contains the change/commit preceding them:
Change | Throughput (events/s) |
---|---|
2024-patches | 172169 |
2024-patches + !1509 (closed) | 188003 |
!1509 (closed) + !1444 (merged) | 184450 |
Removed UTBoards | 187189 |
Zero suppression | 188366 |
Warp atomics | 188801 |
Binned by size | 190724 |
Reduced register usage in clustering + removed UTDecodeInOrder layer loop |
191284 |
Opportunistic looping in UT clustering | 195834 |