Decode Retina into a sorted container
This MR decodes the Retina clusters into a sorted container right away, reducing the memory pressure of the Velo chain of algorithms.
A performance increase of 32% in the velo
subsequence is observed. On Ampere, the sequence as a whole gets about 4-5% faster:
NVIDIA GeForce RTX 3090 │████████████████████████████████████████ 203.49 kHz (1.03x)
NVIDIA RTX A6000 │███████████████████████████████████████ 196.44 kHz (1.02x)
NVIDIA RTX A5000 │████████████████████████████████████ 180.01 kHz (1.02x)
NVIDIA GeForce RTX 2080 Ti │███████████████████████████ 139.05 kHz (1.04x)
│█████████████████ 85.36 kHz (1.05x)
AMD EPYC 7502 32-Core │████ 20.45 kHz (1.07x)
┼────┴────┼────┴────┼────┴────┼────┴────┼────┴────┼ (1.05x)
0 50 100 150 200 250 (1.05x)
Requires !748 (merged)
Edited by Daniel Hugo Campora Perez