Dropped usage of VectorClass
This goes together with Rec!2128 (merged).
Besides cleanup of commented code, the major part is a replacement of VCL by Vc in the Similarity.cpp and MagneticFieldGrid.cpp files. This is not necessarily ideal as more generic code would have been better, but the investment would have been large so I stopped there for the moment, in order to avoid to port the code to VCL2.0.
Concerning speed, all the new code has been benchmarked using google benchmark and compared to the previous version for sse, avx2 and avx512 versions. Note that there is no change for the scalar code and that is no specific avx512 implementation, avx2 code is used but on a different processor, leading to different results. Moreover, for the MagneticFieldGrid case, there a single piece of code (using SSE) for all platforms.
Here are the results for the different methods in Similarity.cpp.
SSE4.2 (nehalem)
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
similarity_51_original<double> 4.52 ns 4.51 ns 154175915
similarity_51_VC<double> 5.86 ns 5.84 ns 120056749
similarity_55_original<double> 37.8 ns 37.7 ns 18550952
similarity_55_VC<double> 38.4 ns 38.3 ns 18344052
similarity_57_original<double> 69.3 ns 69.1 ns 10761154
similarity_57_VC<double> 69.2 ns 69.0 ns 9990288
average_original<double> 110 ns 110 ns 8600122
average_VC<double> 83.1 ns 82.9 ns 7329628
filter_original<double> 26.3 ns 26.3 ns 26581108
filter_VC<double> 26.1 ns 26.0 ns 26588054
AVX2 (haswell)
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
similarity_51_original<double> 5.03 ns 5.01 ns 133549995
similarity_51_VC<double> 5.09 ns 5.08 ns 138580997
similarity_55_original<double> 34.2 ns 34.1 ns 20485924
similarity_55_VC<double> 29.3 ns 29.2 ns 23953218
similarity_57_original<double> 53.2 ns 53.1 ns 13259869
similarity_57_VC<double> 53.1 ns 53.0 ns 13188235
average_original<double> 76.7 ns 76.5 ns 13418446
average_VC<double> 52.2 ns 52.1 ns 10991279
filter_original<double> 29.7 ns 29.7 ns 23724897
filter_VC<double> 30.8 ns 30.7 ns 22851429
AVX512 (cascadelake)
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
similarity_51_original<double> 4.88 ns 4.87 ns 127101152
similarity_51_VC<double> 5.11 ns 5.10 ns 134839807
similarity_55_original<double> 34.5 ns 34.4 ns 21153696
similarity_55_VC<double> 38.1 ns 38.0 ns 13325311
similarity_57_original<double> 49.9 ns 49.8 ns 13746851
similarity_57_VC<double> 50.1 ns 50.0 ns 14028532
average_original<double> 69.2 ns 69.1 ns 10000000
average_VC<double> 49.0 ns 48.9 ns 14350849
filter_original<double> 30.4 ns 30.3 ns 23242475
filter_VC<double> 32.4 ns 32.3 ns 21662432
Here is my summary :
- 2 methods are slightly slower (2-5% level) : similarity_5_1 and filter, filter only on AVX2 and 512
- similarity_5_5 has change in SSE, a gain of 15% in AVX2 and a 10% loss in AVX512
- similarity_5_7 did not change speed
- average is substantially faster on all platforms with a gain of 25-30%
Last point : all methods but maybe similarity_5_1 can be made much faster than what they are with a bit of low level optimization. I did not spend too much time for now. Just tell me whether it's worth it.
And here are the results for the only method in MagneticFieldGrid.cpp
SSE4.2 (nehalem)
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
interpolation_original<float> 3550 ns 3542 ns 196992
interpolation_VC<float> 3572 ns 3563 ns 196354
AVX2 (haswell)
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
interpolation_original<float> 3315 ns 3307 ns 211537
interpolation_VC<float> 3636 ns 3627 ns 192809
AVX512 (cascadelake)
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
interpolation_original<float> 3284 ns 3276 ns 213556
interpolation_VC<float> 3650 ns 3641 ns 191710
Basically no change on nehalem, but quite a loss on AVX platforms (10%). Same question as for the previous part : is it worth investigating further ? It is definitely possible to go back to previous speed, if not exceed it, but one has to spend some time