Document/Test/Fix the "stability" of find index of minimum impl
Intro
We are using a vectorized
implementation .
This is quite faster than the STL.
Micro-Benchmarking can be found in the github repo mentioned in !67937 (merged).
But things can be "really faster".
The issue
There is one issue that in theory if we have the same entry 2 times .
The "STL"/"C" serial will return the 1st index.
The Vec
implementations as we have it will return the index of any of the 2 depending on how they "fall".
To be clear . This is not due to !67937 (merged). But a pre-existing issue (from the 1st implementation)I figured is still there when I was transfering code from my private repo to ATLAS.
For the !67937 (merged). I left it alone. This is the fix I mentioned to the above MR (but was not part of it)
Plan
The plan is to do commits with the STL
C
stable vec
and a fixed version and establish they change the same exact 1 entry. All of them will do .
Let me ping @nstyles @jcatmore @jchapman @jdandoy . There will be one entry ouput changes that we might need to have. As is much better to have "stable output"
Timings
InputRDOFile="/cvmfs/atlas-nightlies.cern.ch/repo/data/data-art/CampaignInputs/mc20/RDO/mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.recon.AOD.e6337_s3681_r13145/100events.RDO.pool.root"
export ATHENA_CORE_NUMBER=1
Reco_tf.py \
--CA \
--inputRDOFile $InputRDOFile \
--outputAODFile myAOD.pool.root \
--preInclude 'egammaConfig.egammaOnlyFromRawFlags.egammaOnlyFromRaw' \
--autoConfiguration 'everything' \
--maxEvents '100' \
--perfmon 'fullmonmt'
For context this part uses to cost a significant amount . Currenlty using the optimised implementation is 8% of the total time or lets say 5440 ms out of the ~ 6800 ms it takes for the 100 ttbar
The micro - benchmarking (https://github.com/AnChristos/FindIdxOfMinimum/) gives
Benchmark | Time | CPU | Iterations |
---|---|---|---|
findMinimumIndexC/2048. | 2687 ns | 2678 ns | 260374 |
findMinimumIndexSTL/2048 | 6111 ns | 6093 ns | 114311 |
findMinimumIndexVecBlend/2048 | 266 ns | 265 ns | 2634353 |
findMinimumIndexVecUnordered/2048 | 111 ns | 111 ns | 6320537. |
findMinimumIndexVecBlend/2048 (stabilized) | 296 ns | 296 ns | 2634353 |
There are some subtle details as we benchamark "random" . The unordered get a bit of extra boost here and actually is a bit slower than Blend < 256 elements. In production the number of elements decreases as we merge (we start from 2556 but we gradually after 60 iterations we have 78).
So what matter more is the numbers below which are total times for the EMBremCollection Builder (slowest egamma one)
Impl. | Total times for EMBremcollectionBuilder Time 99 ttbar ms | Ratio | Notes |
---|---|---|---|
VecBlend (current) | 6856.06 | 1 | Will not always return the index of 1st minima if 2 same values |
STL | 13299.11 | 1.9 | Will return the 1st. Changes 1 entry in the CI |
C | 9518.07 | 1.45 | Will return the 1st. Changes 1 entry in the CI |
VecUnordered | 6851.68 | 1 | Will return the 1st. Changes 1 entry in the CI |
VBlend stabilized | 6855.76 | 1 | Will return the 1st. Changes 1 entry in the CI |
Suggestion
I would consider the "instability" a bug in the sense it will not allows us to change impl in case things improve in the STL side in the feature. So I would go with the stabilizing the vecBlend. And also make the test being able to catch potentially this kind of "issue".