a ~50% chance of a segmentation fault / double free (?) in `RunAllen::operator()` when running `MooreAnalysis` HLT1 + HLT2 chained jobs over Upgrade MC
I noticed this initially in some grid jobs I had submitted to run HLT1 + HLT2 with MooreAnalysis. Either a segmentation fault or a double free seems to occur during event processing. Additionally there also seems to be a chance of a segfault occurring after the event processing has finished. I tried my best to reproduce the problem on an interactive machine (results below)
I am running over the upgrade MC at the bookkeeping path
/MC/Upgrade/Beam7000GeV-Upgrade-MagDown-Nu7.6-25ns-Pythia8/Sim10aU1/25203000/XDIGI
How to reproduce it:
source /cvmfs/lhcb.cern.ch/lib/LbEnv
export PLATFORM="x86_64_v2-centos7-gcc11-opt"
export NIGHTLYSLOT="master"
lb-dev -c "${PLATFORM}" --name "MooreAnalysis_${NIGHTLYSLOT}" --nightly "lhcb-${NIGHTLYSLOT}/Latest" "MooreAnalysis/${NIGHTLYSLOT}"
(
cd ./MooreAnalysis_$NIGHTLYSLOT || return
git lb-use Moore
git lb-checkout Moore/roneil/charm-lc2pkpi-xic2pkpi-xic2pkkpi Hlt
make
)
curl -L -O https://gitlab.cern.ch/lhcb-charm/charm-production-run-3/-/raw/roneil/lc2pkpi-etc/options/hlt2_eff_lc2pkpi_repr.py
# N.B. need a kerberos ticket or grid proxy to run this job
# one needs to keep running this command until the failure is caught, since it appears maybe 40-50% of the time
./MooreAnalysis_master/run gaudirun.py --gdb hlt2_eff_lc2pkpi_repr.py
Here is the last few lines printed from MooreAnalysis
before the double free
Here is the stack trace from gdb
:
#0 0x00007f75402f63d7 in raise () from /lib64/libc.so.6
#1 0x00007f75402f7ac8 in abort () from /lib64/libc.so.6
#2 0x00007f7540338f67 in __libc_message () from /lib64/libc.so.6
#3 0x00007f7540341329 in _int_free () from /lib64/libc.so.6
#4 0x00007f75110deaa8 in RunAllen::operator()(std::array<std::tuple<std::vector<char, std::allocator<char> >, int>, 87ul> const&, LHCb::ODINImplementation::v7::ODIN const&) const ()
from /cvmfs/lhcbdev.cern.ch/nightlies/lhcb-master/latest/Allen_master/InstallArea/x86_64_v2-centos7-gcc11-opt/lib/libAllenWrapper.so
#5 0x00007f75110f0af1 in _ZZN5Gaudi10Functional7details22MultiTransformerFilterIFSt5tupleIJ11HostBuffersEERKSt5arrayIS3_IJSt6vectorIcSaIcEEiEELm87EERKN4LHCb18ODINImplementation2v74ODINEENS0_6Traits4use_IJEEELb1EE7executeEvENKUlDpRT_E_clIJ21DataObjectWriteHandleIS4_S4_EEEEDaSR_ ()
from /cvmfs/lhcbdev.cern.ch/nightlies/lhcb-master/latest/Allen_master/InstallArea/x86_64_v2-centos7-gcc11-opt/lib/libAllenWrapper.so
#6 0x00007f75110f1331 in Gaudi::Functional::details::MultiTransformerFilter<std::tuple<HostBuffers> (std::array<std::tuple<std::vector<char, std::allocator<char> >, int>, 87ul> const&, LHCb::ODINImplementation::v7::ODIN const&), Gaudi::Functional::Traits::use_<>, true>::execute() ()
from /cvmfs/lhcbdev.cern.ch/nightlies/lhcb-master/latest/Allen_master/InstallArea/x86_64_v2-centos7-gcc11-opt/lib/libAllenWrapper.so
#7 0x00007f751ba91220 in Gaudi::Algorithm::sysExecute(EventContext const&) ()
from /cvmfs/lhcbdev.cern.ch/nightlies/lhcb-master/latest/Gaudi_v36r2/InstallArea/x86_64_v2-centos7-gcc11-opt/lib/libGaudiKernel.so
#8 0x00007f751a42ddc0 in GaudiAlgorithm::sysExecute(EventContext const&) ()
from /cvmfs/lhcbdev.cern.ch/nightlies/lhcb-master/latest/Gaudi_v36r2/InstallArea/x86_64_v2-centos7-gcc11-opt/lib/libGaudiAlgLib.so
#9 0x00007f751577538c in AlgWrapper::execute(EventContext&, gsl::span<LHCb::Interfaces::ISchedulerConfiguration::State::AlgState, 18446744073709551615ul>) const ()
from /cvmfs/lhcbdev.cern.ch/nightlies/lhcb-master/latest/LHCb_v53r4/InstallArea/x86_64_v2-centos7-gcc11-opt/lib/libHLTScheduler.so
#10 0x00007f751576710c in HLTControlFlowMgr::push(EventContext&&)::{lambda(EventContext&)#1}::operator()(EventContext&) const ()
from /cvmfs/lhcbdev.cern.ch/nightlies/lhcb-master/latest/LHCb_v53r4/InstallArea/x86_64_v2-centos7-gcc11-opt/lib/libHLTScheduler.so
#11 0x00007f7515767981 in tbb::internal::function_task<(anonymous namespace)::EventTask<HLTControlFlowMgr::push(EventContext&&)::{lambda(EventContext&)#1}> >::execute() ()
from /cvmfs/lhcbdev.cern.ch/nightlies/lhcb-master/latest/LHCb_v53r4/InstallArea/x86_64_v2-centos7-gcc11-opt/lib/libHLTScheduler.so
#12 0x00007f752e29ba45 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7f751a0a3e00, context_guard=..., t=0x7f751a0abc40, isolation=isolation@entry=0)
at ../../src/tbb/custom_scheduler.h:474
#13 0x00007f752e29bd78 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f751a0a3e00, parent=..., child=<optimized out>) at ../../src/tbb/custom_scheduler.h:636
#14 0x00007f752e295857 in tbb::internal::arena::process (this=0x7f751a0b3a00, s=...)
at ../../src/tbb/arena.cpp:196
#15 0x00007f752e294060 in tbb::internal::market::process (this=0x7f751a0bb580, j=...)
at ../../src/tbb/market.cpp:667
#16 0x00007f752e2907ac in tbb::internal::rml::private_worker::run (this=0x7f751240e180)
at ../../src/tbb/private_server.cpp:266
#17 0x00007f752e2909e9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>)
at ../../src/tbb/private_server.cpp:219
#18 0x00007f7540d9eea5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f75403be9fd in clone () from /lib64/libc.so.6
So, is the problem in here? https://gitlab.cern.ch/lhcb/Allen/-/blob/453f67acfd1f69cf965f0c3989c9132320879249/Rec/Allen/src/RunAllen.cpp#L193 ? How can I debug this in more detail?