KLGaussianMixtureReduction use the 16 elements a time implementation for findMinimumIndex
KLGaussianMixtureReduction use the 16 (instead of 8) elements a time implementation for findMinimumIndex. We seem to be able to gain a bit more here.
After that things seem to flatten at least in the 3 machines I benchmarked this (+callgrind), at least up to SSE4.2
For documentation posterity As I think never wrote this.
The 1st implementation commited used intrinsics back then...
e.g look : https://gitlab.cern.ch/atlas/athena/-/blob/606241a2519aed190aa21effcdd7d91094d8a616/Tracking/TrkFitter/TrkGaussianSumFilter/src/KLGaussianMixtureReduction.cxx#L179 what we have is much cleaner due to
For the code inside the loop we expect to compile to (gcc11): x86-64 SSE4.1 (this is what will run in 99%+ of ATLAS machines currently).
movaps xmm0, XMMWORD PTR [rax] paddd xmm7, xmm2 paddd xmm8, xmm2 add rax, 64 paddd xmm9, xmm2 paddd xmm10, xmm2 cmpltps xmm0, xmm6 blendvps xmm6, XMMWORD PTR [rax-64], xmm0 pblendvb xmm13, xmm7, xmm0 movaps xmm0, XMMWORD PTR [rax-48] cmpltps xmm0, xmm3 blendvps xmm3, XMMWORD PTR [rax-48], xmm0 pblendvb xmm12, xmm8, xmm0 movaps xmm0, XMMWORD PTR [rax-32] cmpltps xmm0, xmm5 blendvps xmm5, XMMWORD PTR [rax-32], xmm0 pblendvb xmm11, xmm9, xmm0 movaps xmm0, XMMWORD PTR [rax-16] cmpltps xmm0, xmm1 blendvps xmm1, XMMWORD PTR [rax-16], xmm0 pblendvb xmm4, xmm10, xmm0 cmp rdx, rax
This maps exactly to the CxxUtils::vec methods in the code which is good.
For ARM64 (since we use the CxxUtils::vec and not intrinsics we can readily get it) this looks like
ldp q28, q26, [x1] add v17.4s, v17.4s, v4.4s add v18.4s, v18.4s, v4.4s add v19.4s, v19.4s, v4.4s ldp q24, q22, [x1, 32] add v20.4s, v20.4s, v4.4s add x1, x1, 64 fcmgt v27.4s, v0.4s, v28.4s fcmgt v25.4s, v3.4s, v26.4s fcmgt v23.4s, v1.4s, v24.4s fcmgt v21.4s, v2.4s, v22.4s bit v16.16b, v17.16b, v27.16b bit v0.16b, v28.16b, v27.16b bit v6.16b, v18.16b, v25.16b bit v3.16b, v26.16b, v25.16b bit v7.16b, v19.16b, v23.16b bit v1.16b, v24.16b, v23.16b bit v5.16b, v20.16b, v21.16b bit v2.16b, v22.16b, v21.16b cmp x0, x1