Add support for architecture-specific optimizations
This MR introduces support for architecture-specific optimizations in a better way. It also refactors heavily the backend code.
-
CudaCommon.h
et al. have become its own folderbackend
. It is structured in the following way:-
BackendCommon.h
is the universal entrypoint that should be used by any algorithm that requires cross-architecture functionality (eg. support for CUDA keywords). - Backends for each target exist (eg.
CPUBackend
,CUDABackend
), each with its own implementation of the CUDA keywords and general functionality.
-
-
Allen now supports backend-specific implementations. A dispatcher can be written with a set of overrides for architectures.
using namespace Allen::device; dispatch<target::Default, target::CPU>(fn_default, fn_cpu)(arguments...); dispatch<target::Default, target::CUDA, target::HIP>(fn_default, fn_cuda, fn_hip)(arguments...);
For instance,
fn_cpu
would only be run if running on the deviceTARGET_DEVICE_CPU
. In the second function call, either of the functions would be called depending on the target. -
Allen now supports manual vectorization with the
UME::SIMD
library.UME::SIMD
provides low-level access, while remaining cross-architecture compatible (ARM, Power). It also provides ascalar
backend, ensuring compatibility with other architectures. It is implemented as a git submodule.-
Vector.h
definesVector<T>
for the highest supported vector width on the current architecture or scalar for compatibilty. - It also defines
Vector128<T>
,Vector256<T>
andVector512<T>
, which are either vectors of that bit width or scalar for compatibility. - AVX512 is not supported on gcc-8 onwards.
-
-
The backend and vector behaviour can be changed with the following CMake options:
-
CPU_STATIC_VECTOR_WIDTH
: Changes whatVector<T>
is. Can beOFF
,scalar
,128bits
,256bits
,512bits
. -
ALWAYS_DISPATCH_TO_DEFAULT
: Forces the dispatcher to always dispatch to thetarget::Default
target. - Two functions have been vectorized: calculate_phi and Search by triplet seeding.
-
VELO specific optimizations
-
The number of previous module hit candidates to consider was a constant before. After a study, it was changed from 5 constant, to 4 constant and 8 in the last module (first being explored).
-
Note that further tuning can be done. For instance, one can set the number of candidate to 2 for most modules, which makes the VELO reconstruction up to 5% faster with very little loss in efficiency. Since it is quite fast at the moment anyway, it was left tuned for best efficiency.
-
CPU uses
float
backend forhalf_t
with no transformation, which introduces a slight divergence wrt GPU results. -
Speedup (GPU) is about 10%.
-
Speedup (CPU) is about 2x.
-
Physics efficiency diff: https://gitlab.cern.ch/lhcb/Allen/-/jobs/9079802
-
Current throughput:
Quadro RTX 6000 │███████████████████████████████████████████████ 592.25 kHz GeForce RTX 2080 Ti │████████████████████████████████████████████ 560.07 kHz Tesla V100-PCIE-32GB │███████████████████████████████████████████ 548.67 kHz Tesla T4 │███████████████████ 246.88 kHz AMD EPYC 7502 32-Core Processor │████ 55.82 kHz Intel Xeon E5-2630 v4 │█ 17.63 kHz ┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼ 0 100 200 300 400 500 600
TODO
-
A test still doesn't compile and was commented out (MEP test). -
Inspecting the callgrind of the CPU, it was found that every function call was done through a dynamic call handler. This was turned off for this branch to test speed, but it should be possible to generate a .so
without all of these calls. -
Test on AVX512 architecture.