Dropped usage of VectorClass (!2634) · Merge requests · LHCb / LHCb

Sebastien Ponce requested to merge sponce_VCLdrop into master Jun 25, 2020

This goes together with Rec!2128 (merged).

Besides cleanup of commented code, the major part is a replacement of VCL by Vc in the Similarity.cpp and MagneticFieldGrid.cpp files. This is not necessarily ideal as more generic code would have been better, but the investment would have been large so I stopped there for the moment, in order to avoid to port the code to VCL2.0.

Concerning speed, all the new code has been benchmarked using google benchmark and compared to the previous version for sse, avx2 and avx512 versions. Note that there is no change for the scalar code and that is no specific avx512 implementation, avx2 code is used but on a different processor, leading to different results. Moreover, for the MagneticFieldGrid case, there a single piece of code (using SSE) for all platforms.

Here are the results for the different methods in Similarity.cpp.

SSE4.2 (nehalem)
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
similarity_51_original<double>       4.52 ns         4.51 ns    154175915
similarity_51_VC<double>             5.86 ns         5.84 ns    120056749
similarity_55_original<double>       37.8 ns         37.7 ns     18550952
similarity_55_VC<double>             38.4 ns         38.3 ns     18344052
similarity_57_original<double>       69.3 ns         69.1 ns     10761154
similarity_57_VC<double>             69.2 ns         69.0 ns      9990288
average_original<double>              110 ns          110 ns      8600122
average_VC<double>                   83.1 ns         82.9 ns      7329628
filter_original<double>              26.3 ns         26.3 ns     26581108
filter_VC<double>                    26.1 ns         26.0 ns     26588054

AVX2 (haswell)
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
similarity_51_original<double>       5.03 ns         5.01 ns    133549995
similarity_51_VC<double>             5.09 ns         5.08 ns    138580997
similarity_55_original<double>       34.2 ns         34.1 ns     20485924
similarity_55_VC<double>             29.3 ns         29.2 ns     23953218
similarity_57_original<double>       53.2 ns         53.1 ns     13259869
similarity_57_VC<double>             53.1 ns         53.0 ns     13188235
average_original<double>             76.7 ns         76.5 ns     13418446
average_VC<double>                   52.2 ns         52.1 ns     10991279
filter_original<double>              29.7 ns         29.7 ns     23724897
filter_VC<double>                    30.8 ns         30.7 ns     22851429

AVX512 (cascadelake)
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
similarity_51_original<double>       4.88 ns         4.87 ns    127101152
similarity_51_VC<double>             5.11 ns         5.10 ns    134839807
similarity_55_original<double>       34.5 ns         34.4 ns     21153696
similarity_55_VC<double>             38.1 ns         38.0 ns     13325311
similarity_57_original<double>       49.9 ns         49.8 ns     13746851
similarity_57_VC<double>             50.1 ns         50.0 ns     14028532
average_original<double>             69.2 ns         69.1 ns     10000000
average_VC<double>                   49.0 ns         48.9 ns     14350849
filter_original<double>              30.4 ns         30.3 ns     23242475
filter_VC<double>                    32.4 ns         32.3 ns     21662432

Here is my summary :

2 methods are slightly slower (2-5% level) : similarity_5_1 and filter, filter only on AVX2 and 512
similarity_5_5 has change in SSE, a gain of 15% in AVX2 and a 10% loss in AVX512
similarity_5_7 did not change speed
average is substantially faster on all platforms with a gain of 25-30%

Last point : all methods but maybe similarity_5_1 can be made much faster than what they are with a bit of low level optimization. I did not spend too much time for now. Just tell me whether it's worth it.

And here are the results for the only method in MagneticFieldGrid.cpp

SSE4.2 (nehalem)
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
interpolation_original<float>       3550 ns         3542 ns       196992
interpolation_VC<float>             3572 ns         3563 ns       196354

AVX2 (haswell)
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
interpolation_original<float>       3315 ns         3307 ns       211537
interpolation_VC<float>             3636 ns         3627 ns       192809

AVX512 (cascadelake)
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
interpolation_original<float>       3284 ns         3276 ns       213556
interpolation_VC<float>             3650 ns         3641 ns       191710

Basically no change on nehalem, but quite a loss on AVX platforms (10%). Same question as for the previous part : is it worth investigating further ? It is definitely possible to go back to previous speed, if not exceed it, but one has to spend some time

Edited Aug 14, 2020 by Marco Cattaneo

Dropped usage of VectorClass

Merge request reports