ATLASRECTS-5244 :GSF , try to use SSE for the "find minimum *index*"
This relates to
I would like @ssnyder and @amorley to take a very close look. The short story is that this method seems to be slowest method in GSF with around ~36% of the total time. GSF refits are the leading cpu consumer in egamma.
In reality one would expect that at least certain versions of Finding the index of the minimum could be autovectorized or at least optimised in a reasonable and consistent manner.
This seems to not be the case, or at least clang
/gcc
do different things for this code (even in -03
)
int32_t
GSFUtils::findMinimumIndexSTL(const floatPtrRestrict distancesIn, const int n)
{
float* array = (float*)__builtin_assume_aligned(distancesIn, alignment);
return std::distance(array, std::min_element(array, array + n));
}
@ssnyder is looking at this part and thinks that we might need to fill a bug in gcc . Long story short if you try the STL
vis a vis the scalar version of this MR you can get the opposite picture on faster/slower on clang/gcc. Obviously having this done by the compiler in a consistent manner is much better long term...
Anyhow, on the same micro-benchmarks where we discovered the above difference of gcc vs clang we have tried to try out certain SSE
solutions (which so far seem to do the same for gcc and clang as they are written via intrinsics). This bring in one of them as a first try.
I have tried to hide everything in the implementation file . In case of not SSE2
we can fall back to the scalar implementation. I added also the STL way.
Tests:
- RunTier0 log : RunTier0Tests.log
Profiles :
-
Seems that even with doing one minimum , not a pair should make quickCloseComponents faster.
-
There might some further things to do here.
-
@amorley note that indeed we call
findMinimumindex
~1.8x more , but stil ends costing ~2/3 of the initial time. So is constistent with being ~3x faster per call which is what I see with the compiler options we use in ATLAS in my benchmarks. -
@ssnyder as you know I have tried a couple of variations. Perhaps we could try to improve on the 3x here in the future as what I have done is not the most involved/clevel way possibe ...