SEGV in hlt2_pp_thor throughput test

assigned to @sponce

changed the description

mentioned in merge request LHCb!4129 (merged)

another crash in https://lhcbpr-hlt.web.cern.ch/PerfTests/UpgradeThroughput/Throughput_lhcb-master-mr.8312_Moore_hlt2_pp_thor_x86_64_v3-centos7-gcc11+detdesc-opt+g_2023-06-21_04:48:56_+0200/

crash2.txt

this one points to https://gitlab.cern.ch/lhcb/Rec/-/blob/master/Phys/ParticleMaker/src/FunctionalDiElectronMaker.cpp#L195 Let me try to summarise what this is doing, as I do not see anything obvious, so help debugging this is welcome:

The line in question is:

      vertices.add( vtx.release() );

where auto vtx = std::make_unique<LHCb::Vertex>(); and

  auto result                             = std::tuple<LHCb::Particles, LHCb::Vertices, LHCb::Particles>{};
  auto& [diElectrons, vertices, children] = result;
  diElectrons.reserve( electrons.size() * electrons.size() ); // upper limit
  vertices.reserve( diElectrons.size() );

which should be reserving more than the possible number of combinations, as we loop from 0 to n for the first electron (e1), and from e1+1 to n for the second one.

Then vtx is filled by auto combSc = m_combiner->combine( {e1.get(), e2.get()}, *diElec, *vtx, *lhcb.geometry() ); and it is explicitly checked that the combiner succeeds before adding vtx to vertices.

Any idea what can be going wrong? @graven @mveghel @mramospe I add a simplified version of the full code logic below in case it helps:

std::tuple<LHCb::Particles, LHCb::Vertices, LHCb::Particles> FunctionalDiElectronMaker::
                                                             operator()( LHCb::Particle::Range const& electrons, DetectorElement const& lhcb ) const {

  // output containers
  auto result                             = std::tuple<LHCb::Particles, LHCb::Vertices, LHCb::Particles>{};
  auto& [diElectrons, vertices, children] = result;
  diElectrons.reserve( electrons.size() * electrons.size() ); // upper limit
  vertices.reserve( diElectrons.size() );
  children.reserve( diElectrons.size() * 2 );
//(... debug messages ...
  for ( auto i1 = electrons.begin(); i1 != electrons.end(); ++i1 ) {
//(... check i1 ...)
    for ( auto i2 = i1 + 1; i2 != electrons.end(); ++i2 ) {
//(... check i2 ...)
      // clone electrons so we only modify the clone
      std::unique_ptr<LHCb::Particle> e1( ( *i1 )->clone() );
      std::unique_ptr<LHCb::Particle> e2( ( *i2 )->clone() );
//(... apply brem correction and some cuts ...)
      // combine electrons
      auto diElec = std::make_unique<LHCb::Particle>( m_particle_prop->particleID() );
      auto vtx    = std::make_unique<LHCb::Vertex>();
      auto combSc = m_combiner->combine( {e1.get(), e2.get()}, *diElec, *vtx, *lhcb.geometry() );
      if ( combSc.isFailure() || ( diElec->p() < 0. ) ) {
//(... try alternative combinations and call continue if they fail ...)
      diElectrons.add( diElec.release() );
      vertices.add( vtx.release() );
      children.add( e1.release() );
      children.add( e2.release() );
    }
  }

I had a look out of curiosity and I agree there is nothing obvious in your code. However, you're still using KeyedContainer there, and a quick look at the code behind (last levels of the stack after your code) makes a crash much less surprising ! We really have to drop this class, it's evil.

Having said that, to help further I would need to reproduce locally. Any clue how to achieve that ? I suppose it's not happening systematically ? Or does it ?

thanks @sponce ! I am afraid it's not reproducible, no. The test has run successfully 15 times after this last crash was reported. Also the previous crash observed for this test, the original one reported in the description of this issue, points to a different part of the code. So it's not straight forward.

Maybe one can run from a sanitizier build to understand what's going on? Looking at https://lhcb-nightlies.web.cern.ch/nightly/lhcb-sanitizers/1059/Moore/x86_64_v2-centos7-gcc11+detdesc-dbg+alubsan/tests#Hlt2Conf_hlt2_pp_thor it reports some memory leaks and then there are FATALs due to DisableJIT being enabled

Then it's about thread safety and corrupted memory due to data races. And the problem is most probably not in this code but in something not related at all just happening to be close in memory on that execution and thus corrupted our memory. Bottom line : we can just give up here and need to continue the cleanup of thread unsafe code (aka non functional algos to start with). There is a first list to tackle here : Rec#411

I was hoping for a more focused approach, as Rec#411 includes many things that we are not using in hlt2_pp_thor and we see similar behaviours when running HLT2 at the pit. So focusing on fixing the issues with the current HLT2 configuration is quite a high priority at the moment

mentioned in merge request !2446 (merged)

mentioned in issue Rec#411

I had a look at the original log in crash.txt.

What I see are the LoKi::DistanceCalculator and things like Gaudi::Accumulators::Counter, which makes me wonder if the _Warning in the LoKi::DistanceCalculator (https://gitlab.cern.ch/lhcb/Rec/-/blob/master/Phys/LoKiFitters/src/MessagingBase.h#L89) is actually thread safe.

Doing some archeology I found this: Phys!968 (merged), concluding with "Further work will be necessary to actually make LoKi::DistanceCalculator thread-safe.", and no commit was made to the code since then.

_Warning uses Warning and that is not thread safe.

In addition the use of m_myAlg is also not thread safe.

Can I operate under the assumption that you are preparing a fix?

Unfortunately not (right now)…

... but I found a bit a time, so yes.

see Rec!3474 (merged)

SEGV in hlt2_pp_thor throughput test

Designs

Child items ...

Activity