Add support for architecture-specific optimizations
This MR introduces support for architecture-specific optimizations in a better way. It also refactors heavily the backend code.
-
CudaCommon.het al. have become its own folderbackend. It is structured in the following way:-
BackendCommon.his the universal entrypoint that should be used by any algorithm that requires cross-architecture functionality (eg. support for CUDA keywords). - Backends for each target exist (eg.
CPUBackend,CUDABackend), each with its own implementation of the CUDA keywords and general functionality.
-
-
Allen now supports backend-specific implementations. A dispatcher can be written with a set of overrides for architectures.
using namespace Allen::device; dispatch<target::Default, target::CPU>(fn_default, fn_cpu)(arguments...); dispatch<target::Default, target::CUDA, target::HIP>(fn_default, fn_cuda, fn_hip)(arguments...);For instance,
fn_cpuwould only be run if running on the deviceTARGET_DEVICE_CPU. In the second function call, either of the functions would be called depending on the target. -
Allen now supports manual vectorization with the
UME::SIMDlibrary.UME::SIMDprovides low-level access, while remaining cross-architecture compatible (ARM, Power). It also provides ascalarbackend, ensuring compatibility with other architectures. It is implemented as a git submodule.-
Vector.hdefinesVector<T>for the highest supported vector width on the current architecture or scalar for compatibilty. - It also defines
Vector128<T>,Vector256<T>andVector512<T>, which are either vectors of that bit width or scalar for compatibility. - AVX512 is not supported on gcc-8 onwards.
-
-
The backend and vector behaviour can be changed with the following CMake options:
-
CPU_STATIC_VECTOR_WIDTH: Changes whatVector<T>is. Can beOFF,scalar,128bits,256bits,512bits. -
ALWAYS_DISPATCH_TO_DEFAULT: Forces the dispatcher to always dispatch to thetarget::Defaulttarget. - Two functions have been vectorized: calculate_phi and Search by triplet seeding.
-
VELO specific optimizations
-
The number of previous module hit candidates to consider was a constant before. After a study, it was changed from 5 constant, to 4 constant and 8 in the last module (first being explored).
-
Note that further tuning can be done. For instance, one can set the number of candidate to 2 for most modules, which makes the VELO reconstruction up to 5% faster with very little loss in efficiency. Since it is quite fast at the moment anyway, it was left tuned for best efficiency.
-
CPU uses
floatbackend forhalf_twith no transformation, which introduces a slight divergence wrt GPU results. -
Speedup (GPU) is about 10%.
-
Speedup (CPU) is about 2x.
-
Physics efficiency diff: https://gitlab.cern.ch/lhcb/Allen/-/jobs/9079802
-
Current throughput:
Quadro RTX 6000 │███████████████████████████████████████████████ 592.25 kHz GeForce RTX 2080 Ti │████████████████████████████████████████████ 560.07 kHz Tesla V100-PCIE-32GB │███████████████████████████████████████████ 548.67 kHz Tesla T4 │███████████████████ 246.88 kHz AMD EPYC 7502 32-Core Processor │████ 55.82 kHz Intel Xeon E5-2630 v4 │█ 17.63 kHz ┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼ 0 100 200 300 400 500 600
TODO
-
A test still doesn't compile and was commented out (MEP test). -
Inspecting the callgrind of the CPU, it was found that every function call was done through a dynamic call handler. This was turned off for this branch to test speed, but it should be possible to generate a .sowithout all of these calls. -
Test on AVX512 architecture.