Add support for architecture-specific optimizations
-
Review changes -
-
Download -
Patches
-
Plain diff
This MR introduces support for architecture-specific optimizations in a better way. It also refactors heavily the backend code.
-
CudaCommon.h
et al. have become its own folderbackend
. It is structured in the following way:-
BackendCommon.h
is the universal entrypoint that should be used by any algorithm that requires cross-architecture functionality (eg. support for CUDA keywords). - Backends for each target exist (eg.
CPUBackend
,CUDABackend
), each with its own implementation of the CUDA keywords and general functionality.
-
-
Allen now supports backend-specific implementations. A dispatcher can be written with a set of overrides for architectures.
using namespace Allen::device; dispatch<target::Default, target::CPU>(fn_default, fn_cpu)(arguments...); dispatch<target::Default, target::CUDA, target::HIP>(fn_default, fn_cuda, fn_hip)(arguments...);
For instance,
fn_cpu
would only be run if running on the deviceTARGET_DEVICE_CPU
. In the second function call, either of the functions would be called depending on the target. -
Allen now supports manual vectorization with the
UME::SIMD
library.UME::SIMD
provides low-level access, while remaining cross-architecture compatible (ARM, Power). It also provides ascalar
backend, ensuring compatibility with other architectures. It is implemented as a git submodule.-
Vector.h
definesVector<T>
for the highest supported vector width on the current architecture or scalar for compatibilty. - It also defines
Vector128<T>
,Vector256<T>
andVector512<T>
, which are either vectors of that bit width or scalar for compatibility. - AVX512 is not supported on gcc-8 onwards.
-
-
The backend and vector behaviour can be changed with the following CMake options:
-
CPU_STATIC_VECTOR_WIDTH
: Changes whatVector<T>
is. Can beOFF
,scalar
,128bits
,256bits
,512bits
. -
ALWAYS_DISPATCH_TO_DEFAULT
: Forces the dispatcher to always dispatch to thetarget::Default
target. - Two functions have been vectorized: calculate_phi and Search by triplet seeding.
-
VELO specific optimizations
-
The number of previous module hit candidates to consider was a constant before. After a study, it was changed from 5 constant, to 4 constant and 8 in the last module (first being explored).
-
Note that further tuning can be done. For instance, one can set the number of candidate to 2 for most modules, which makes the VELO reconstruction up to 5% faster with very little loss in efficiency. Since it is quite fast at the moment anyway, it was left tuned for best efficiency.
-
CPU uses
float
backend forhalf_t
with no transformation, which introduces a slight divergence wrt GPU results. -
Speedup (GPU) is about 10%.
-
Speedup (CPU) is about 2x.
-
Physics efficiency diff: https://gitlab.cern.ch/lhcb/Allen/-/jobs/9079802
-
Current throughput:
Quadro RTX 6000 │███████████████████████████████████████████████ 592.25 kHz GeForce RTX 2080 Ti │████████████████████████████████████████████ 560.07 kHz Tesla V100-PCIE-32GB │███████████████████████████████████████████ 548.67 kHz Tesla T4 │███████████████████ 246.88 kHz AMD EPYC 7502 32-Core Processor │████ 55.82 kHz Intel Xeon E5-2630 v4 │█ 17.63 kHz ┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼ 0 100 200 300 400 500 600
TODO
-
A test still doesn't compile and was commented out (MEP test). -
Inspecting the callgrind of the CPU, it was found that every function call was done through a dynamic call handler. This was turned off for this branch to test speed, but it should be possible to generate a .so
without all of these calls. -
Test on AVX512 architecture.
Merge request reports
- version 72b5ba6628
- version 71075b9108
- version 70d58db325
- version 694a3bed25
- version 68a1b4bc36
- version 67af5b6666
- version 66a9236fe2
- version 65ac56535e
- version 64b1cef929
- version 63ba3e3a6e
- version 62e82e03f8
- version 615460eaff
- version 60196a2c20
- version 59e9b118ea
- version 58c6128dd1
- version 57ba5c68bc
- version 5686c83926
- version 55862486d5
- version 54b12ca3c3
- version 53bf6b74c8
- version 52329dd2d1
- version 51cb4d3219
- version 5003c7b272
- version 49abc92074
- version 48519a142d
- version 47c72c7bfd
- version 46de4db945
- version 453668b81b
- version 4402f78add
- version 43a3bd0a0c
- version 42fe529217
- version 417b7972e2
- version 4008099040
- version 396e59ceaf
- version 38e06859dc
- version 3790392ffa
- version 365b1db3f6
- version 354b160848
- version 346bc589a5
- version 3322df304a
- version 32b4de0492
- version 315bfc65b5
- version 3060fbd137
- version 292596124d
- version 28f07599ac
- version 27f4d1d83d
- version 266a379e33
- version 259911ff5c
- version 24467c2c88
- version 2360236731
- version 222bf17a31
- version 21ea165991
- version 20f558c2ab
- version 19f34cf07d
- version 1818828e9d
- version 17f7feec82
- version 160698035c
- version 151e4458b8
- version 142f6de9c8
- version 132407718a
- version 12cb1609e4
- version 117aa10192
- version 10bf7d536d
- version 9b86b0f39
- version 8aac297b4
- version 7e1f18c3a
- version 601106f0e
- version 50fab912f
- version 414a2d286
- version 36fecb78b
- version 27b9946bb
- version 16de19f6a
- master (base)
- latest version11e1d63569 commits,
- version 72b5ba662868 commits,
- version 71075b910867 commits,
- version 70d58db32566 commits,
- version 694a3bed2565 commits,
- version 68a1b4bc3665 commits,
- version 67af5b666664 commits,
- version 66a9236fe263 commits,
- version 65ac56535e62 commits,
- version 64b1cef92961 commits,
- version 63ba3e3a6e60 commits,
- version 62e82e03f859 commits,
- version 615460eaff57 commits,
- version 60196a2c2056 commits,
- version 59e9b118ea55 commits,
- version 58c6128dd154 commits,
- version 57ba5c68bc53 commits,
- version 5686c8392652 commits,
- version 55862486d551 commits,
- version 54b12ca3c350 commits,
- version 53bf6b74c849 commits,
- version 52329dd2d148 commits,
- version 51cb4d321947 commits,
- version 5003c7b27246 commits,
- version 49abc9207445 commits,
- version 48519a142d44 commits,
- version 47c72c7bfd43 commits,
- version 46de4db94549 commits,
- version 453668b81b48 commits,
- version 4402f78add47 commits,
- version 43a3bd0a0c46 commits,
- version 42fe52921745 commits,
- version 417b7972e244 commits,
- version 400809904043 commits,
- version 396e59ceaf42 commits,
- version 38e06859dc41 commits,
- version 3790392ffa40 commits,
- version 365b1db3f639 commits,
- version 354b16084838 commits,
- version 346bc589a537 commits,
- version 3322df304a36 commits,
- version 32b4de049235 commits,
- version 315bfc65b534 commits,
- version 3060fbd13733 commits,
- version 292596124d32 commits,
- version 28f07599ac31 commits,
- version 27f4d1d83d30 commits,
- version 266a379e3329 commits,
- version 259911ff5c28 commits,
- version 24467c2c8826 commits,
- version 236023673125 commits,
- version 222bf17a3124 commits,
- version 21ea16599121 commits,
- version 20f558c2ab20 commits,
- version 19f34cf07d19 commits,
- version 1818828e9d18 commits,
- version 17f7feec8217 commits,
- version 160698035c16 commits,
- version 151e4458b815 commits,
- version 142f6de9c814 commits,
- version 132407718a13 commits,
- version 12cb1609e412 commits,
- version 117aa1019211 commits,
- version 10bf7d536d10 commits,
- version 9b86b0f399 commits,
- version 8aac297b48 commits,
- version 7e1f18c3a7 commits,
- version 601106f0e6 commits,
- version 50fab912f5 commits,
- version 414a2d2864 commits,
- version 36fecb78b3 commits,
- version 27b9946bb2 commits,
- version 16de19f6a1 commit,
- Side-by-side
- Inline