CaloRecGPU: Double Gaussian, Moments Finally Understood, New Changes, Kernel Size Optimization Service and ART Tests
Major additions:
-
GPUKernelSizeOptimizerSvc
: optimizes block sizes, possibly reading them from JSON files, defaulting to the CUDA occupancy estimators. Comes with the associated interfaces,IGPUKernelSizeOptimizerSvc
(within Athena) andIGPUKernelSizeOptimizer
(without any Athena includes, for direct use in CUDA files).
Major changes:
- Implemented double Gaussian noise. Clusters no longer match perfectly when using it due to floating point accuracy, no real way to fix...
- Rewrote kernels before moments calculation to use the memory allocated for the moments object as a temporary buffer
- Got rid of
PairsArr
in the API, it's now an implementation detail of cluster growing and splitting - Other changes/improvements in the kernel code, especially in terms of clarity...
- Got rid of
- Fixed the moments calculation as much as possible, differences remain due to floating point accuracy and explicit cut-offs.
- Fixed inconsistent accuracy in CPU weighting of shared cells.
- Rewrote calculations to use CUDA some intrinsics for better efficiency and accuracy
- Started using
floats
for the matrix eigenvector/eigenvalue computation, as the extra accuracy ofdouble
was not enough to fix the differences we'd have anyway, so it does not make sense to hurt performance for (almost) no gain
- Implemented the EM cross talk cell time cut.
- Got rid of most instances of dynamic parallelism to launch exactly the number of threads, after testing showed it was more performant to launch more threads.
- Rewrote all the kernels to allow for arbitrary block and grid sizes, allowing potentially better resource utilization.
- Added support of cooperative kernel launches for more performant iterative kernels, with fallback to the previous solution with dynamic parallelism (with tail launches if available) if unavailable.
- Kernel block and grid sizes no longer static, but set according to the
GPUKernelSizeOptimizerSvc
. - Added a check to ART tests to ensure the average number of unmatched clusters and the average number of different cells are less than 0.1 (so the tests now actually do something)
Minor changes:
- Got rid of unnecessary legacy support in
StandaloneDataIO.h
(we never had results from previous versions we needed to compare to...). - Added some new matching options (namely the option to only match clusters that have exactly the same cells, useful for moments comparisons...)
- Improved moments output histogram ranges for visibility
- Encapsulated duplicate code in import/export tools via a lambda
- Separated some of the CUDA-friendly structures/classes into new headers for better organization.
- Reworked neighbour handling to store the total number of neighbours in the last bits of the 64-bit offset storage, making it a full prefix sum...
- Added more information (subcalo, sampling and region) to the cell information, as well as
is_PS
andis_HECIW_or_FCAL
for limited neighbour handling
- Added more information (subcalo, sampling and region) to the cell information, as well as
- Added some CUDA error checking where missing