Dropped usage of VectorClass (!2634) · Merge requests · LHCb / LHCb

Merged Sebastien Ponce requested to merge sponce_VCLdrop into master 4 years ago

This goes together with Rec!2128 (merged).

Besides cleanup of commented code, the major part is a replacement of VCL by Vc in the Similarity.cpp and MagneticFieldGrid.cpp files. This is not necessarily ideal as more generic code would have been better, but the investment would have been large so I stopped there for the moment, in order to avoid to port the code to VCL2.0.

Concerning speed, all the new code has been benchmarked using google benchmark and compared to the previous version for sse, avx2 and avx512 versions. Note that there is no change for the scalar code and that is no specific avx512 implementation, avx2 code is used but on a different processor, leading to different results. Moreover, for the MagneticFieldGrid case, there a single piece of code (using SSE) for all platforms.

Here are the results for the different methods in Similarity.cpp.

SSE4.2 (nehalem)
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
similarity_51_original<double>       4.52 ns         4.51 ns    154175915
similarity_51_VC<double>             5.86 ns         5.84 ns    120056749
similarity_55_original<double>       37.8 ns         37.7 ns     18550952
similarity_55_VC<double>             38.4 ns         38.3 ns     18344052
similarity_57_original<double>       69.3 ns         69.1 ns     10761154
similarity_57_VC<double>             69.2 ns         69.0 ns      9990288
average_original<double>              110 ns          110 ns      8600122
average_VC<double>                   83.1 ns         82.9 ns      7329628
filter_original<double>              26.3 ns         26.3 ns     26581108
filter_VC<double>                    26.1 ns         26.0 ns     26588054

AVX2 (haswell)
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
similarity_51_original<double>       5.03 ns         5.01 ns    133549995
similarity_51_VC<double>             5.09 ns         5.08 ns    138580997
similarity_55_original<double>       34.2 ns         34.1 ns     20485924
similarity_55_VC<double>             29.3 ns         29.2 ns     23953218
similarity_57_original<double>       53.2 ns         53.1 ns     13259869
similarity_57_VC<double>             53.1 ns         53.0 ns     13188235
average_original<double>             76.7 ns         76.5 ns     13418446
average_VC<double>                   52.2 ns         52.1 ns     10991279
filter_original<double>              29.7 ns         29.7 ns     23724897
filter_VC<double>                    30.8 ns         30.7 ns     22851429

AVX512 (cascadelake)
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
similarity_51_original<double>       4.88 ns         4.87 ns    127101152
similarity_51_VC<double>             5.11 ns         5.10 ns    134839807
similarity_55_original<double>       34.5 ns         34.4 ns     21153696
similarity_55_VC<double>             38.1 ns         38.0 ns     13325311
similarity_57_original<double>       49.9 ns         49.8 ns     13746851
similarity_57_VC<double>             50.1 ns         50.0 ns     14028532
average_original<double>             69.2 ns         69.1 ns     10000000
average_VC<double>                   49.0 ns         48.9 ns     14350849
filter_original<double>              30.4 ns         30.3 ns     23242475
filter_VC<double>                    32.4 ns         32.3 ns     21662432

Here is my summary :

2 methods are slightly slower (2-5% level) : similarity_5_1 and filter, filter only on AVX2 and 512
similarity_5_5 has change in SSE, a gain of 15% in AVX2 and a 10% loss in AVX512
similarity_5_7 did not change speed
average is substantially faster on all platforms with a gain of 25-30%

Last point : all methods but maybe similarity_5_1 can be made much faster than what they are with a bit of low level optimization. I did not spend too much time for now. Just tell me whether it's worth it.

And here are the results for the only method in MagneticFieldGrid.cpp

SSE4.2 (nehalem)
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
interpolation_original<float>       3550 ns         3542 ns       196992
interpolation_VC<float>             3572 ns         3563 ns       196354

AVX2 (haswell)
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
interpolation_original<float>       3315 ns         3307 ns       211537
interpolation_VC<float>             3636 ns         3627 ns       192809

AVX512 (cascadelake)
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
interpolation_original<float>       3284 ns         3276 ns       213556
interpolation_VC<float>             3650 ns         3641 ns       191710

Basically no change on nehalem, but quite a loss on AVX platforms (10%). Same question as for the previous part : is it worth investigating further ? It is definitely possible to go back to previous speed, if not exceed it, but one has to spend some time

Edited 4 years ago by Marco Cattaneo

Activity

Sebastien Ponce mentioned in issue #91 (closed) 4 years ago

mentioned in issue #91 (closed)
S

Sebastien Ponce @sponce · 4 years ago

Author Owner

\ci-test
Christopher Rob Jones @jonrob · 4 years ago

Maintainer

@graven FYI. @sponce Thanks for this. I think you got the wrong slash in the above so it won't start, so I'll do one myself. I'll also assigned to a nightlies slot so we can start to check throughput.
Christopher Rob Jones @jonrob · 4 years ago

Maintainer

Resolved 4 years ago by Sebastien Ponce

/ci-test --merge

Last reply by Sebastien Ponce 4 years ago
Christopher Rob Jones added lhcb-gaudi-head label 4 years ago

added lhcb-gaudi-head label
L
Software for LHCb @lhcbsoft · 4 years ago

Developer
[2020-06-25 12:31] Validation started with lhcb-master-mr#987

[2020-06-25 16:46] Validation started with lhcb-master-mr#988

[2020-06-27 00:12] Validation started with lhcb-sanitizers#608

[2020-06-27 18:51] Validation started with lhcb-master-mr#998

[2020-06-28 00:05] Validation started with lhcb-sanitizers#609

[2020-06-29 00:07] Validation started with lhcb-sanitizers#610

Edited 4 years ago by Software for LHCb
Marco Cattaneo added backport run2 label 4 years ago

added backport run2 label
Marco Cattaneo @cattanem · 4 years ago

Maintainer

I have added the backport run2 label because we will definitely need this also in run2-patches
Sebastien Ponce resolved all threads 4 years ago

resolved all threads
Sebastien Ponce added 2 commits 4 years ago
added 2 commits

25f07eb0 - Full removal of vectorclass from LHCbMath and Kernel

cbee7ca5 - Fixed formatting

Compare with previous version
S

Sebastien Ponce @sponce · 4 years ago

Author Owner

Resolved 4 years ago by Sebastien Ponce

/ci-test --merge

Last reply by Sebastien Ponce 4 years ago
Christopher Rob Jones removed lhcb-gaudi-head label 4 years ago

removed lhcb-gaudi-head label
Christopher Rob Jones assigned to @sponce 4 years ago

assigned to @sponce
Christopher Rob Jones added lhcb-sanitizers label 4 years ago

added lhcb-sanitizers label
Sebastien Ponce added 4 commits 4 years ago
added 4 commits

94bda5ca - Started replacement of vectorclass by Vc in LHCbMath

5f4125cf - Full removal of vectorclass from LHCbMath and Kernel

5c47720b - Removed vectorclass from MagneticFieldGrid

7f963ddb - Final removal of vectorclass from LHCb

Compare with previous version
Toggle commit list
Sebastien Ponce changed title from Dropped usage of VCL in LHCbKernel to Dropped usage of VCL in LHCb 4 years ago

changed title from Dropped usage of VCL in LHCbKernel to Dropped usage of VCL in LHCb
Sebastien Ponce changed the description 4 years ago

changed the description
Sebastien Ponce added 1 commit 4 years ago
added 1 commit

b416d297 - Fixed formatting

Compare with previous version
Christopher Rob Jones @jonrob · 4 years ago

Maintainer

Resolved 4 years ago by Sebastien Ponce

/ci-test --merge

Last reply by Software for LHCb 4 years ago
Christopher Rob Jones removed lhcb-sanitizers label 4 years ago

removed lhcb-sanitizers label

Please register or sign in to reply

Dropped usage of VectorClass

Merge request reports

Activity