Observed in LHCb!4129 (comment 6779950) . The throughput was halved because one of the two jobs crashed. The logs are here and the relevant stack trace is in crash.txt
Edited
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
which should be reserving more than the possible number of combinations, as we loop from 0 to n for the first electron (e1), and from e1+1 to n for the second one.
Then vtx is filled by auto combSc = m_combiner->combine( {e1.get(), e2.get()}, *diElec, *vtx, *lhcb.geometry() ); and it is explicitly checked that the combiner succeeds before adding vtx to vertices.
Any idea what can be going wrong? @graven@mveghel@mramospe I add a simplified version of the full code logic below in case it helps:
std::tuple<LHCb::Particles, LHCb::Vertices, LHCb::Particles> FunctionalDiElectronMaker:: operator()( LHCb::Particle::Range const& electrons, DetectorElement const& lhcb ) const { // output containers auto result = std::tuple<LHCb::Particles, LHCb::Vertices, LHCb::Particles>{}; auto& [diElectrons, vertices, children] = result; diElectrons.reserve( electrons.size() * electrons.size() ); // upper limit vertices.reserve( diElectrons.size() ); children.reserve( diElectrons.size() * 2 );//(... debug messages ... for ( auto i1 = electrons.begin(); i1 != electrons.end(); ++i1 ) {//(... check i1 ...) for ( auto i2 = i1 + 1; i2 != electrons.end(); ++i2 ) {//(... check i2 ...) // clone electrons so we only modify the clone std::unique_ptr<LHCb::Particle> e1( ( *i1 )->clone() ); std::unique_ptr<LHCb::Particle> e2( ( *i2 )->clone() );//(... apply brem correction and some cuts ...) // combine electrons auto diElec = std::make_unique<LHCb::Particle>( m_particle_prop->particleID() ); auto vtx = std::make_unique<LHCb::Vertex>(); auto combSc = m_combiner->combine( {e1.get(), e2.get()}, *diElec, *vtx, *lhcb.geometry() ); if ( combSc.isFailure() || ( diElec->p() < 0. ) ) {//(... try alternative combinations and call continue if they fail ...) diElectrons.add( diElec.release() ); vertices.add( vtx.release() ); children.add( e1.release() ); children.add( e2.release() ); } }
I had a look out of curiosity and I agree there is nothing obvious in your code. However, you're still using KeyedContainer there, and a quick look at the code behind (last levels of the stack after your code) makes a crash much less surprising ! We really have to drop this class, it's evil.
Having said that, to help further I would need to reproduce locally. Any clue how to achieve that ? I suppose it's not happening systematically ? Or does it ?
thanks @sponce ! I am afraid it's not reproducible, no. The test has run successfully 15 times after this last crash was reported. Also the previous crash observed for this test, the original one reported in the description of this issue, points to a different part of the code. So it's not straight forward.
Then it's about thread safety and corrupted memory due to data races. And the problem is most probably not in this code but in something not related at all just happening to be close in memory on that execution and thus corrupted our memory. Bottom line : we can just give up here and need to continue the cleanup of thread unsafe code (aka non functional algos to start with). There is a first list to tackle here : Rec#411
I was hoping for a more focused approach, as Rec#411 includes many things that we are not using in hlt2_pp_thor and we see similar behaviours when running HLT2 at the pit. So focusing on fixing the issues with the current HLT2 configuration is quite a high priority at the moment
Doing some archeology I found this: Phys!968 (merged), concluding with "Further work will be necessary to actually make LoKi::DistanceCalculator thread-safe.", and no commit was made to the code since then.