This MR introduces support for architecture-specific optimizations in a better way. It also refactors heavily the backend code.
CudaCommon.h
et al. have become its own folder backend
. It is structured in the following way:
BackendCommon.h
is the universal entrypoint that should be used by any algorithm that requires cross-architecture functionality (eg. support for CUDA keywords).CPUBackend
, CUDABackend
), each with its own implementation of the CUDA keywords and general functionality.Allen now supports backend-specific implementations. A dispatcher can be written with a set of overrides for architectures.
using namespace Allen::device;
dispatch<target::Default, target::CPU>(fn_default, fn_cpu)(arguments...);
dispatch<target::Default, target::CUDA, target::HIP>(fn_default, fn_cuda, fn_hip)(arguments...);
For instance, fn_cpu
would only be run if running on the device TARGET_DEVICE_CPU
. In the second function call, either of the functions would be called depending on the target.
Allen now supports manual vectorization with the UME::SIMD
library. UME::SIMD
provides low-level access, while remaining cross-architecture compatible (ARM, Power). It also provides a scalar
backend, ensuring compatibility with other architectures. It is implemented as a git submodule.
Vector.h
defines Vector<T>
for the highest supported vector width on the current architecture or scalar for compatibilty.Vector128<T>
, Vector256<T>
and Vector512<T>
, which are either vectors of that bit width or scalar for compatibility.The backend and vector behaviour can be changed with the following CMake options:
CPU_STATIC_VECTOR_WIDTH
: Changes what Vector<T>
is. Can be OFF
, scalar
, 128bits
, 256bits
, 512bits
.ALWAYS_DISPATCH_TO_DEFAULT
: Forces the dispatcher to always dispatch to the target::Default
target.The number of previous module hit candidates to consider was a constant before. After a study, it was changed from 5 constant, to 4 constant and 8 in the last module (first being explored).
Note that further tuning can be done. For instance, one can set the number of candidate to 2 for most modules, which makes the VELO reconstruction up to 5% faster with very little loss in efficiency. Since it is quite fast at the moment anyway, it was left tuned for best efficiency.
CPU uses float
backend for half_t
with no transformation, which introduces a slight divergence wrt GPU results.
Speedup (GPU) is about 10%.
Speedup (CPU) is about 2x.
Physics efficiency diff: https://gitlab.cern.ch/lhcb/Allen/-/jobs/9079802
Current throughput:
Quadro RTX 6000 │███████████████████████████████████████████████ 592.25 kHz
GeForce RTX 2080 Ti │████████████████████████████████████████████ 560.07 kHz
Tesla V100-PCIE-32GB │███████████████████████████████████████████ 548.67 kHz
Tesla T4 │███████████████████ 246.88 kHz
AMD EPYC 7502 32-Core Processor │████ 55.82 kHz
Intel Xeon E5-2630 v4 │█ 17.63 kHz
┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼
0 100 200 300 400 500 600
.so
without all of these calls.