Add support for architecture-specific optimizations (!414) · Merge requests · LHCb / Allen

This MR introduces support for architecture-specific optimizations in a better way. It also refactors heavily the backend code.

CudaCommon.h et al. have become its own folder backend. It is structured in the following way:
- BackendCommon.h is the universal entrypoint that should be used by any algorithm that requires cross-architecture functionality (eg. support for CUDA keywords).
- Backends for each target exist (eg. CPUBackend, CUDABackend), each with its own implementation of the CUDA keywords and general functionality.
Allen now supports backend-specific implementations. A dispatcher can be written with a set of overrides for architectures.
```
using namespace Allen::device;
dispatch<target::Default, target::CPU>(fn_default, fn_cpu)(arguments...);
dispatch<target::Default, target::CUDA, target::HIP>(fn_default, fn_cuda, fn_hip)(arguments...);
```
For instance, fn_cpu would only be run if running on the device TARGET_DEVICE_CPU. In the second function call, either of the functions would be called depending on the target.
Allen now supports manual vectorization with the UME::SIMD library. UME::SIMD provides low-level access, while remaining cross-architecture compatible (ARM, Power). It also provides a scalar backend, ensuring compatibility with other architectures. It is implemented as a git submodule.
- Vector.h defines Vector<T> for the highest supported vector width on the current architecture or scalar for compatibilty.
- It also defines Vector128<T>, Vector256<T> and Vector512<T>, which are either vectors of that bit width or scalar for compatibility.
- AVX512 is not supported on gcc-8 onwards.
The backend and vector behaviour can be changed with the following CMake options:
- CPU_STATIC_VECTOR_WIDTH: Changes what Vector<T> is. Can be OFF, scalar, 128bits, 256bits, 512bits.
- ALWAYS_DISPATCH_TO_DEFAULT: Forces the dispatcher to always dispatch to the target::Default target.
- Two functions have been vectorized: calculate_phi and Search by triplet seeding.

VELO specific optimizations

The number of previous module hit candidates to consider was a constant before. After a study, it was changed from 5 constant, to 4 constant and 8 in the last module (first being explored).
Note that further tuning can be done. For instance, one can set the number of candidate to 2 for most modules, which makes the VELO reconstruction up to 5% faster with very little loss in efficiency. Since it is quite fast at the moment anyway, it was left tuned for best efficiency.
CPU uses float backend for half_t with no transformation, which introduces a slight divergence wrt GPU results.
Speedup (GPU) is about 10%.
Speedup (CPU) is about 2x.
Physics efficiency diff: https://gitlab.cern.ch/lhcb/Allen/-/jobs/9079802

Current throughput:

Quadro RTX 6000                 │███████████████████████████████████████████████  592.25 kHz
GeForce RTX 2080 Ti             │████████████████████████████████████████████     560.07 kHz
Tesla V100-PCIE-32GB            │███████████████████████████████████████████      548.67 kHz
Tesla T4                        │███████████████████                              246.88 kHz
AMD EPYC 7502 32-Core Processor │████                                             55.82 kHz
Intel Xeon E5-2630 v4           │█                                                17.63 kHz
                                ┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼───┴───┼
                                0      100     200     300     400     500     600

TODO

A test still doesn't compile and was commented out (MEP test).
Inspecting the callgrind of the CPU, it was found that every function call was done through a dynamic call handler. This was turned off for this branch to test speed, but it should be possible to generate a .so without all of these calls.
Test on AVX512 architecture.

Edited Oct 01, 2020 by Daniel Hugo Campora Perez

Add support for architecture-specific optimizations

VELO specific optimizations

TODO

Merge request reports