KLGaussianMixtureReduction use the 16 elements a time implementation for findMinimumIndex (!49696) · Merge requests · atlas / athena

Christos Anastopoulos requested to merge ATLAS-EGamma/athena:Component1D_add_setter into master Jan 15, 2022

KLGaussianMixtureReduction use the 16 (instead of 8) elements a time implementation for findMinimumIndex. We seem to be able to gain a bit more here.

After that things seem to flatten at least in the 3 machines I benchmarked this (+callgrind), at least up to SSE4.2

For documentation posterity As I think never wrote this.

The 1st implementation commited used intrinsics back then... e.g look : https://gitlab.cern.ch/atlas/athena/-/blob/606241a2519aed190aa21effcdd7d91094d8a616/Tracking/TrkFitter/TrkGaussianSumFilter/src/KLGaussianMixtureReduction.cxx#L179 what we have is much cleaner due to CxxUtils::vec

For the code inside the loop we expect to compile to (gcc11): x86-64 SSE4.1 (this is what will run in 99%+ of ATLAS machines currently).

  movaps xmm0, XMMWORD PTR [rax]
  paddd xmm7, xmm2
  paddd xmm8, xmm2
  add rax, 64
  paddd xmm9, xmm2
  paddd xmm10, xmm2
  cmpltps xmm0, xmm6
  blendvps xmm6, XMMWORD PTR [rax-64], xmm0
  pblendvb xmm13, xmm7, xmm0
  movaps xmm0, XMMWORD PTR [rax-48]
  cmpltps xmm0, xmm3
  blendvps xmm3, XMMWORD PTR [rax-48], xmm0
  pblendvb xmm12, xmm8, xmm0
  movaps xmm0, XMMWORD PTR [rax-32]
  cmpltps xmm0, xmm5
  blendvps xmm5, XMMWORD PTR [rax-32], xmm0
  pblendvb xmm11, xmm9, xmm0
  movaps xmm0, XMMWORD PTR [rax-16]
  cmpltps xmm0, xmm1
  blendvps xmm1, XMMWORD PTR [rax-16], xmm0
  pblendvb xmm4, xmm10, xmm0
  cmp rdx, rax

This maps exactly to the CxxUtils::vec methods in the code which is good.

For ARM64 (since we use the CxxUtils::vec and not intrinsics we can readily get it) this looks like

  ldp q28, q26, [x1]
  add v17.4s, v17.4s, v4.4s
  add v18.4s, v18.4s, v4.4s
  add v19.4s, v19.4s, v4.4s
  ldp q24, q22, [x1, 32]
  add v20.4s, v20.4s, v4.4s
  add x1, x1, 64
  fcmgt v27.4s, v0.4s, v28.4s
  fcmgt v25.4s, v3.4s, v26.4s
  fcmgt v23.4s, v1.4s, v24.4s
  fcmgt v21.4s, v2.4s, v22.4s
  bit v16.16b, v17.16b, v27.16b
  bit v0.16b, v28.16b, v27.16b
  bit v6.16b, v18.16b, v25.16b
  bit v3.16b, v26.16b, v25.16b
  bit v7.16b, v19.16b, v23.16b
  bit v1.16b, v24.16b, v23.16b
  bit v5.16b, v20.16b, v21.16b
  bit v2.16b, v22.16b, v21.16b
  cmp x0, x1

Edited Jan 15, 2022 by Christos Anastopoulos

Admin message

KLGaussianMixtureReduction use the 16 elements a time implementation for findMinimumIndex

Merge request reports