Implement a faster UT decoder

A faster UT decoder idea would be as follows. Each item is a CUDA kernel:

Calculate number of hits in each sector group (as is done now).
Prefix sum over these (as is done now).
Iterate over raw banks / hits and store only the Y coordinate, and an uint32_t encoding the following: raw_bank number and hit id inside the raw bank. Let's refer to this array as raw_bank_hits.
Calculate permutation of sort by Y per sector group.
Apply permutation on Y and on the raw_bank_hits.
Iterate over raw_bank_hits and decode hits in the order they will be stored. Instead of storing them directly, store them in shared memory, synchronize, and then store in a fashion that is efficient for SOA. (See eg. https://gitlab.cern.ch/lhcb-parallelization/cuda_hlt/blob/master/cuda/velo/consolidate_tracks/src/ConsolidateTracks.cu#L159).

After all this, all UT hits will be decoded and in the desired order.

mentioned in merge request !34 (merged)

added Doing label

assigned to @dcampora

closed via merge request !34 (merged)

Designs