Optimised UT Decoding on top of Arthur's General Optimization (!1621) · Merge requests · LHCb / Allen

Da Yu Tou requested to merge dtou_optimise_ut_tracking into 2024-patches May 07, 2024

This MR is closed by !1663 (merged)

The optimizations in this MR is ported to !1663 (merged) to be independent of !1509 (closed).

This MR is closed by !1663 (merged)

Made on top of !1509 (closed) and !1444 (merged) and added improvements to UT decoding. Some part of the code are architecturally dependent (e.g. 8-element prefix sum, warp atomics and opportunistic looping) but these are only about 10-20 lines of code.

Changes:

Resolved conflicting changes in UT decoding by !1509 (closed) and !1444 (merged).
Removed UTBoards since we only need a sourceID -> geometry table map.
Zero suppressed empty UT raw banks (no longer waste idle threads on them).
Added warp atomics for UT decoding. Reusable by other parts of Allen.
Binned UT raw banks so we decode large ones first, smaller ones later. This would equalize UT decoding workload while clustering.
Reduced register usage (55 -> 53) in UTClusterAndPreDecode by using 16-bit integers instead of 32-bit where possible.
Very tiny (0.1 kHz) improvement from removing a loop over UT layer in UTDecodeInOrder, increasing occupancy of that kernel by 30%.
Opportunistic looping over UT raw bank lanes. The binning of UT lanes by size means the first warp will have more work to do. Rather than using a deterministic looping over UT lanes, opportunistic looping means that once all the threads in a warp has finished decoding 32 UT lanes (raw banks), you directly do an atomicAdd to get the next 32 lanes to decode. This should equalize the workload across the warps within a thread block.

Throughput running trackmatching_veloscifi_and_utdecoding sequence on A5000, each row contains the change/commit preceding them:

Change	Throughput (events/s)
2024-patches	172169
2024-patches + !1509 (closed)	188003
!1509 (closed) + !1444 (merged)	184450
Removed UTBoards	187189
Zero suppression	188366
Warp atomics	188801
Binned by size	190724
Reduced register usage in clustering + removed `UTDecodeInOrder` layer loop	191284
Opportunistic looping in UT clustering	195834

Edited Jun 11, 2024 by Da Yu Tou

Optimised UT Decoding on top of Arthur's General Optimization

Merge request reports