Optimized UT decoding, better coverage of UT hits test and fix to UT unique x sector (!1663) · Merge requests · LHCb / Allen

Da Yu Tou requested to merge dtou_optimise_ut_decoding into 2024-patches Jun 06, 2024

Changes:

Optimized UT decoding, which is ported from !1621 (closed) to be independent of !1509 (closed) (optimization table copied below). Also, this MR will make merge conflicts in !1509 (closed) easier to resolve.
UT hits are now sorted with a unique key and deterministic tiebreak.
Improved UT hit test coverage. The test of comparing HLT1 vs HLT2 UT hits is more stringent now as it requires 100% overlap in LHCbID and no duplicated UT hit LHCbID.
Increased x tolerance of creating unique UT sector group x. This will recover our efficiencies of finding UT hits with matching (see below). Also, the unique sector group x is now the smallest x value of the sector group, which should help binary search.
Other small changes include code cleanup and refactoring.

More elaborate details below.

FYI: @dovombru @mveghel @jzhuo @hawu @ldufour @cagapopo

--- HLT1 Efficiencies With/Without UT ---

The following plots are made using matching_with_ut_clustering_opt_and_downstream branch with the change in unique sector group x from this MR. Should be tested again once matching with UT is merged to 2024-patches. The only difference between orange and blue histogram is whether HLT1 uses UT hits. The input MEP file and HLT2 sequence is exactly the same.

Before, initially shown during LHCb Week.

After tuning x tolerances of unique sector group, we recover a lot of charm mesons.

TrackMVA trigger rate on Run 295336 (mu=3).

Line	Without UT	With UT
Hlt1TrackMVA	198.35 +/- 1.50	121.40 +/- 1.17
Hlt1TwoTrackMVA	287.06 +/- 1.80	184.26 +/- 1.44

--- Throughput With/Without MR ---

Note: The optimizations reported in the table below is with/without the optimizations in this MR are tested with matching_with_ut_clustering_opt_and_downstream branch (!1663 (merged) + !1509 (closed) + !1198 (merged)):

Matching no UT: 108.7 kHz
Matching with UT: 94.6 kHz
Matching with UT + optimized UT decoding: 98.8 kHz

Unfortunately, we do not see such a large gain in ci-test of this MR so far, usually in the 1-3% ballpark (albeit with hlt1_pp_default which is a different sequence). This MR reduces register pressures in UT decoding CUDA kernels which increases occupancy, so I suspect the optimizations in !1509 (closed) and this MR has larger speedup when put together rather than measured individually.

--- Copied from !1621 (closed) ---

UT decoding optimization:

Removed UTBoards since we only need a sourceID -> geometry table map.
Zero suppressed empty UT raw banks (no longer waste idle threads on them).
Added warp atomics for UT decoding. Reusable by other parts of Allen.
Binned UT raw banks so we decode large ones first, smaller ones later. This would equalize UT decoding workload while clustering.
Reduced register usage (55 -> 53) in UTClusterAndPreDecode by using 16-bit integers instead of 32-bit where possible.
Very tiny (0.1 kHz) improvement from removing a loop over UT layer in UTDecodeInOrder, increasing occupancy of that kernel by 30%.
Opportunistic looping over UT raw bank lanes. The binning of UT lanes by size means the first warp will have more work to do. Rather than using a deterministic looping over UT lanes, opportunistic looping means that once all the threads in a warp has finished decoding 32 UT lanes (raw banks), you directly do an atomicAdd to get the next 32 lanes to decode. This should equalize the workload across the warps within a thread block.

Throughput running trackmatching_veloscifi_and_utdecoding sequence on A5000, each row contains the change/commit preceding them:

Change	Throughput (events/s)
2024-patches	172169
2024-patches + !1509 (closed)	188003
!1509 (closed) + !1444 (merged)	184450
Removed UTBoards	187189
Zero suppression	188366
Warp atomics	188801
Binned by size	190724
Reduced register usage in clustering + removed `UTDecodeInOrder` layer loop	191284
Opportunistic looping in UT clustering	195834

Edited Jun 09, 2024 by Da Yu Tou

Optimized UT decoding, better coverage of UT hits test and fix to UT unique x sector

Merge request reports