CaloRecGPU: Double Gaussian, Moments Finally Understood, New Changes, Kernel Size Optimization Service and ART Tests (!64838) · Merge requests · atlas / athena

Nuno Dos Santos Fernandes requested to merge dossantn/athena:moments-service-and-ART into main Aug 06, 2023

Major additions:

GPUKernelSizeOptimizerSvc: optimizes block sizes, possibly reading them from JSON files, defaulting to the CUDA occupancy estimators. Comes with the associated interfaces, IGPUKernelSizeOptimizerSvc (within Athena) and IGPUKernelSizeOptimizer (without any Athena includes, for direct use in CUDA files).

Major changes:

Implemented double Gaussian noise. Clusters no longer match perfectly when using it due to floating point accuracy, no real way to fix...
Rewrote kernels before moments calculation to use the memory allocated for the moments object as a temporary buffer
- Got rid of PairsArr in the API, it's now an implementation detail of cluster growing and splitting
- Other changes/improvements in the kernel code, especially in terms of clarity...
Fixed the moments calculation as much as possible, differences remain due to floating point accuracy and explicit cut-offs.
- Fixed inconsistent accuracy in CPU weighting of shared cells.
- Rewrote calculations to use CUDA some intrinsics for better efficiency and accuracy
- Started using floats for the matrix eigenvector/eigenvalue computation, as the extra accuracy of double was not enough to fix the differences we'd have anyway, so it does not make sense to hurt performance for (almost) no gain
Implemented the EM cross talk cell time cut.
Got rid of most instances of dynamic parallelism to launch exactly the number of threads, after testing showed it was more performant to launch more threads.
Rewrote all the kernels to allow for arbitrary block and grid sizes, allowing potentially better resource utilization.
Added support of cooperative kernel launches for more performant iterative kernels, with fallback to the previous solution with dynamic parallelism (with tail launches if available) if unavailable.
Kernel block and grid sizes no longer static, but set according to the GPUKernelSizeOptimizerSvc.
Added a check to ART tests to ensure the average number of unmatched clusters and the average number of different cells are less than 0.1 (so the tests now actually do something)

Minor changes:

Got rid of unnecessary legacy support in StandaloneDataIO.h (we never had results from previous versions we needed to compare to...).
Added some new matching options (namely the option to only match clusters that have exactly the same cells, useful for moments comparisons...)
Improved moments output histogram ranges for visibility
Encapsulated duplicate code in import/export tools via a lambda
Separated some of the CUDA-friendly structures/classes into new headers for better organization.
Reworked neighbour handling to store the total number of neighbours in the last bits of the 64-bit offset storage, making it a full prefix sum...
- Added more information (subcalo, sampling and region) to the cell information, as well as is_PS and is_HECIW_or_FCAL for limited neighbour handling
Added some CUDA error checking where missing

CaloRecGPU: Double Gaussian, Moments Finally Understood, New Changes, Kernel Size Optimization Service and ART Tests

Merge request reports