Skip to content

ATLASRECTS-5244 :GSF , try to use SSE for the "find minimum *index*"

Christos Anastopoulos requested to merge ATLAS-EGamma/athena:SSE_For_GSF into master

This relates to

I would like @ssnyder and @amorley to take a very close look. The short story is that this method seems to be slowest method in GSF with around ~36% of the total time. GSF refits are the leading cpu consumer in egamma.

In reality one would expect that at least certain versions of Finding the index of the minimum could be autovectorized or at least optimised in a reasonable and consistent manner.

This seems to not be the case, or at least clang /gcc do different things for this code (even in -03)

int32_t
GSFUtils::findMinimumIndexSTL(const floatPtrRestrict distancesIn, const int n)
{
  float* array = (float*)__builtin_assume_aligned(distancesIn, alignment);
  return std::distance(array, std::min_element(array, array + n));
}

@ssnyder is looking at this part and thinks that we might need to fill a bug in gcc . Long story short if you try the STL vis a vis the scalar version of this MR you can get the opposite picture on faster/slower on clang/gcc. Obviously having this done by the compiler in a consistent manner is much better long term...

Anyhow, on the same micro-benchmarks where we discovered the above difference of gcc vs clang we have tried to try out certain SSE solutions (which so far seem to do the same for gcc and clang as they are written via intrinsics). This bring in one of them as a first try.

I have tried to hide everything in the implementation file . In case of not SSE2 we can fall back to the scalar implementation. I added also the STL way.

Tests:

Profiles :

  • Reference Ref

  • With this MR New

  • Seems that even with doing one minimum , not a pair should make quickCloseComponents faster.

  • There might some further things to do here.

  • @amorley note that indeed we call findMinimumindex ~1.8x more , but stil ends costing ~2/3 of the initial time. So is constistent with being ~3x faster per call which is what I see with the compiler options we use in ATLAS in my benchmarks.

  • @ssnyder as you know I have tried a couple of variations. Perhaps we could try to improve on the 3x here in the future as what I have done is not the most involved/clevel way possibe ...

Edited by Christos Anastopoulos

Merge request reports

Loading