Skip to content

Draft: Optimised UT Decoding on top of Arthur's General Optimization

Da Yu Tou requested to merge dtou_optimise_ut_tracking into 2024-patches

Made on top of !1509 and !1444 and added improvements to UT decoding. Some part of the code are architecturally dependent (e.g. 8-element prefix sum, warp atomics and opportunistic looping) but these are only about 10-20 lines of code.

Changes:

  1. Resolved conflicting changes in UT decoding by !1509 and !1444.
  2. Removed UTBoards since we only need a sourceID -> geometry table map.
  3. Zero suppressed empty UT raw banks (no longer waste idle threads on them).
  4. Added warp atomics for UT decoding. Reusable by other parts of Allen.
  5. Binned UT raw banks so we decode large ones first, smaller ones later. This would equalize UT decoding workload while clustering.
  6. Reduced register usage (55 -> 53) in UTClusterAndPreDecode by using 16-bit integers instead of 32-bit where possible.
  7. Very tiny (0.1 kHz) improvement from removing a loop over UT layer in UTDecodeInOrder, increasing occupancy of that kernel by 30%.
  8. Opportunistic looping over UT raw bank lanes. The binning of UT lanes by size means the first warp will have more work to do. Rather than using a deterministic looping over UT lanes, opportunistic looping means that once all the threads in a warp has finished decoding 32 UT lanes (raw banks), you directly do an atomicAdd to get the next 32 lanes to decode. This should equalize the workload across the warps within a thread block.

Throughput running trackmatching_veloscifi_and_utdecoding sequence on A5000, each row contains the change/commit preceding them:

Change Throughput (events/s)
2024-patches 172169
2024-patches + !1509 188003
!1509 + !1444 184450
Removed UTBoards 187189
Zero suppression 188366
Warp atomics 188801
Binned by size 190724
Reduced register usage in clustering + removed UTDecodeInOrder layer loop 191284
Opportunistic looping in UT clustering 195834
Edited by Da Yu Tou

Merge request reports