A faster UT decoder idea would be as follows. Each item is a CUDA kernel:
Calculate number of hits in each sector group (as is done now).
Prefix sum over these (as is done now).
Iterate over raw banks / hits and store only the Y coordinate, and an uint32_t encoding the following: raw_bank number and hit id inside the raw bank. Let's refer to this array as raw_bank_hits.
Calculate permutation of sort by Y per sector group.