KLGaussianMixtureReduction use the 16 elements a time implementation for findMinimumIndex
KLGaussianMixtureReduction use the 16 (instead of 8) elements a time implementation for findMinimumIndex. We seem to be able to gain a bit more here.
After that things seem to flatten at least in the 3 machines I benchmarked this (+callgrind), at least up to SSE4.2
For documentation posterity As I think never wrote this.
The 1st implementation commited used intrinsics back then...
e.g look : https://gitlab.cern.ch/atlas/athena/-/blob/606241a2519aed190aa21effcdd7d91094d8a616/Tracking/TrkFitter/TrkGaussianSumFilter/src/KLGaussianMixtureReduction.cxx#L179 what we have is much cleaner due to CxxUtils::vec
For the code inside the loop we expect to compile to (gcc11): x86-64 SSE4.1 (this is what will run in 99%+ of ATLAS machines currently).
movaps xmm0, XMMWORD PTR [rax]
paddd xmm7, xmm2
paddd xmm8, xmm2
add rax, 64
paddd xmm9, xmm2
paddd xmm10, xmm2
cmpltps xmm0, xmm6
blendvps xmm6, XMMWORD PTR [rax-64], xmm0
pblendvb xmm13, xmm7, xmm0
movaps xmm0, XMMWORD PTR [rax-48]
cmpltps xmm0, xmm3
blendvps xmm3, XMMWORD PTR [rax-48], xmm0
pblendvb xmm12, xmm8, xmm0
movaps xmm0, XMMWORD PTR [rax-32]
cmpltps xmm0, xmm5
blendvps xmm5, XMMWORD PTR [rax-32], xmm0
pblendvb xmm11, xmm9, xmm0
movaps xmm0, XMMWORD PTR [rax-16]
cmpltps xmm0, xmm1
blendvps xmm1, XMMWORD PTR [rax-16], xmm0
pblendvb xmm4, xmm10, xmm0
cmp rdx, rax
This maps exactly to the CxxUtils::vec methods in the code which is good.
For ARM64 (since we use the CxxUtils::vec and not intrinsics we can readily get it) this looks like
ldp q28, q26, [x1]
add v17.4s, v17.4s, v4.4s
add v18.4s, v18.4s, v4.4s
add v19.4s, v19.4s, v4.4s
ldp q24, q22, [x1, 32]
add v20.4s, v20.4s, v4.4s
add x1, x1, 64
fcmgt v27.4s, v0.4s, v28.4s
fcmgt v25.4s, v3.4s, v26.4s
fcmgt v23.4s, v1.4s, v24.4s
fcmgt v21.4s, v2.4s, v22.4s
bit v16.16b, v17.16b, v27.16b
bit v0.16b, v28.16b, v27.16b
bit v6.16b, v18.16b, v25.16b
bit v3.16b, v26.16b, v25.16b
bit v7.16b, v19.16b, v23.16b
bit v1.16b, v24.16b, v23.16b
bit v5.16b, v20.16b, v21.16b
bit v2.16b, v22.16b, v21.16b
cmp x0, x1