Add support for architecture-specific optimizations

Merged Daniel Hugo Campora Perez requested to merge dcampora_test_velo_performance into master

This MR introduces support for architecture-specific optimizations in a better way. It also refactors heavily the backend code.

  • CudaCommon.h et al. have become its own folder backend. It is structured in the following way:

    • BackendCommon.h is the universal entrypoint that should be used by any algorithm that requires cross-architecture functionality (eg. support for CUDA keywords).
    • Backends for each target exist (eg. CPUBackend, CUDABackend), each with its own implementation of the CUDA keywords and general functionality.
  • Allen now supports backend-specific implementations. A dispatcher can be written with a set of overrides for architectures.

    using namespace Allen::device;
    dispatch<target::Default, target::CPU>(fn_default, fn_cpu)(arguments...);
    dispatch<target::Default, target::CUDA, target::HIP>(fn_default, fn_cuda, fn_hip)(arguments...);

    For instance, fn_cpu would only be run if running on the device TARGET_DEVICE_CPU. In the second function call, either of the functions would be called depending on the target.

  • Allen now supports manual vectorization with the UME::SIMD library. UME::SIMD provides low-level access, while remaining cross-architecture compatible (ARM, Power). It also provides a scalar backend, ensuring compatibility with other architectures. It is implemented as a git submodule.

    • Vector.h defines Vector<T> for the highest supported vector width on the current architecture or scalar for compatibilty.
    • It also defines Vector128<T>, Vector256<T> and Vector512<T>, which are either vectors of that bit width or scalar for compatibility.
    • AVX512 is not supported on gcc-8 onwards.
  • The backend and vector behaviour can be changed with the following CMake options:

    • CPU_STATIC_VECTOR_WIDTH: Changes what Vector<T> is. Can be OFF, scalar, 128bits, 256bits, 512bits.
    • ALWAYS_DISPATCH_TO_DEFAULT: Forces the dispatcher to always dispatch to the target::Default target.
    • Two functions have been vectorized: calculate_phi and Search by triplet seeding.

VELO specific optimizations

  • The number of previous module hit candidates to consider was a constant before. After a study, it was changed from 5 constant, to 4 constant and 8 in the last module (first being explored).

  • Note that further tuning can be done. For instance, one can set the number of candidate to 2 for most modules, which makes the VELO reconstruction up to 5% faster with very little loss in efficiency. Since it is quite fast at the moment anyway, it was left tuned for best efficiency.

  • CPU uses float backend for half_t with no transformation, which introduces a slight divergence wrt GPU results.

  • Speedup (GPU) is about 10%.

  • Speedup (CPU) is about 2x.

  • Physics efficiency diff:

  • Current throughput:

    Quadro RTX 6000                 │███████████████████████████████████████████████  592.25 kHz
    GeForce RTX 2080 Ti             │████████████████████████████████████████████     560.07 kHz
    Tesla V100-PCIE-32GB            │███████████████████████████████████████████      548.67 kHz
    Tesla T4                        │███████████████████                              246.88 kHz
    AMD EPYC 7502 32-Core Processor │████                                             55.82 kHz
    Intel Xeon E5-2630 v4           │█                                                17.63 kHz
                                    0      100     200     300     400     500     600   


  • A test still doesn't compile and was commented out (MEP test).
  • Inspecting the callgrind of the CPU, it was found that every function call was done through a dynamic call handler. This was turned off for this branch to test speed, but it should be possible to generate a .so without all of these calls.
  • Test on AVX512 architecture.
Edited by Daniel Hugo Campora Perez

Merge request reports