Optimised UT Decoding on top of Arthur's General Optimization
This MR is closed by !1663 (merged)
The optimizations in this MR is ported to !1663 (merged) to be independent of !1509 (closed).
This MR is closed by !1663 (merged)
Made on top of !1509 (closed) and !1444 (merged) and added improvements to UT decoding. Some part of the code are architecturally dependent (e.g. 8-element prefix sum, warp atomics and opportunistic looping) but these are only about 10-20 lines of code.
Changes:
- Resolved conflicting changes in UT decoding by !1509 (closed) and !1444 (merged).
- Removed UTBoards since we only need a sourceID -> geometry table map.
- Zero suppressed empty UT raw banks (no longer waste idle threads on them).
- Added warp atomics for UT decoding. Reusable by other parts of Allen.
- Binned UT raw banks so we decode large ones first, smaller ones later. This would equalize UT decoding workload while clustering.
- Reduced register usage (55 -> 53) in
UTClusterAndPreDecode
by using 16-bit integers instead of 32-bit where possible. - Very tiny (0.1 kHz) improvement from removing a loop over UT layer in
UTDecodeInOrder
, increasing occupancy of that kernel by 30%. - Opportunistic looping over UT raw bank lanes. The binning of UT lanes by size means the first warp will have more work to do. Rather than using a deterministic looping over UT lanes, opportunistic looping means that once all the threads in a warp has finished decoding 32 UT lanes (raw banks), you directly do an atomicAdd to get the next 32 lanes to decode. This should equalize the workload across the warps within a thread block.
Throughput running trackmatching_veloscifi_and_utdecoding
sequence on A5000
, each row contains the change/commit preceding them:
Change | Throughput (events/s) |
---|---|
2024-patches | 172169 |
2024-patches + !1509 (closed) | 188003 |
!1509 (closed) + !1444 (merged) | 184450 |
Removed UTBoards | 187189 |
Zero suppression | 188366 |
Warp atomics | 188801 |
Binned by size | 190724 |
Reduced register usage in clustering + removed UTDecodeInOrder layer loop |
191284 |
Opportunistic looping in UT clustering | 195834 |
Edited by Da Yu Tou