The benchmark test Moore_spruce_rate_and_test crashed since June 1st, some of the previous tests are fine (e.g. test for lhcb-master-mr.4580).
Dataset
All files of upgrade-minbias-hlt2-persistreco-output in TestFileDB, which are output MDF files from a series of Hlt2 persistreco lines.
Moore version used (or nightly, as appropriate)
Any nightly build since June 1st. An example is lhcb-master.1701. The platform is x86_64_v3-centos7-gcc11-opt+g. Its log file of the Moore run is under
Hi @sesen , I reproduced the input files yesterday with latest Moore master (which should be lhcb-master.1701), and the issue still exists, the latest log file can be found under
Hi @shunan, I'm hoping to show some nice moni plots of Sprucing next Tuesday in the DPA report, hence it would be ideal not to wait long. I did not follow on this here; how far off are you from getting the issue sorted?
Unfortunately this issue is unknown to us, and I hope lhcb-datapkg/PRConfig!233 (merged) could fix this. So let's try to merge it first, and if that doesn't work, some further input from @sesen would be appreciated.
BTW, the latest test for lhcb-master.1705 is successful, so it's still unclear why and how does the crash happen. I think you can refer to its PR url if we didn't get this fixed by next Tuesday.
@chasse could you take a look into this issue? A recent working example is lhcb-master.1705, and all other tests since June 1st failed. They produced similar errors related to several unpackers.
There were some permission issue on the test machine yesterday, I'll tag Maciej for it.
As discussed, I've reverted the PRConfig MR, as that will just break the test even more.
I've just had a look at the logs from lhcb-master 1704, which are here: /eos/lhcb/storage/lhcbpr/www/UpgradeRateTest/RateTest_lhcb-master.1704_Moore_spruce_rate_and_size_x86_64_v3-centos7-gcc11-opt+g_2022-06-05_10
The job logs show:
HltDecReportsDecoder WARNING TCK obtained from rawbank seems to be 0 -- blindly ASSUMING that the current HltANNSvc somehow has the same configuration as when the input data was written. Proceed at your own risk, good luck...HltDecReportsDecoder WARNING TCK obtained from rawbank seems to be 0 -- blindly ASSUMING that the current HltANNSvc somehow has the same configuration as when the input data was written. Proceed at your own risk, good luck...HltDecReportsDecoder WARNING TCK obtained from rawbank seems to be 0 -- blindly ASSUMING that the current HltANNSvc somehow has the same configuration as when the input data was written. Proceed at your own risk, good luck...HltDecReportsDecoder WARNING TCK obtained from rawbank seems to be 0 -- blindly ASSUMING that the current HltANNSvc somehow has the same configuration as when the input data was written. Proceed at your own risk, good luck...HltDecReportsDecoder WARNING TCK obtained from rawbank seems to be 0 -- blindly ASSUMING that the current HltANNSvc somehow has the same configuration as when the input data was written. Proceed at your own risk, good luck... *** Break *** segmentation violation *** Break *** HltDecReportsDecoder WARNING TCK obtained from rawbank seems to be 0 -- blindly ASSUMING that the current HltANNSvc somehow has the same configuration as when the input data was written. Proceed at your own risk, good luck...segmentation violation
followed by many more lines of that warning.
Unfortunately, I'm not too familiar with the recently merged rewrite of the decoding... @sesen any idea what could be going wrong?
edit: adding the stack trace too:
The lines below might hint at the cause of the crash.You may get help by asking at the ROOT forum https://root.cern.ch/forumOnly if you are really convinced it is a bug in ROOT then please submit areport at https://root.cern.ch/bugs Please post the ENTIRE stack tracefrom above as an attachment in addition to anything elsethat might help us fixing this issue.===========================================================#7 std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > >::pair (this=0xfffffffff760b8a0) at /cvmfs/lhcb.cern.ch/lib/lcg/releases/gcc/11.1.0-e80bf/x86_64-centos7/include/c++/11.1.0/bits/stl_pair.h:314#8 __gnu_cxx::new_allocator<std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > > >::construct<std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > >, std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > > const&> (__p=0xfffffffff760b8a0, this=0x11ecd780) at /cvmfs/lhcb.cern.ch/lib/lcg/releases/gcc/11.1.0-e80bf/x86_64-centos7/include/c++/11.1.0/ext/new_allocator.h:156#9 std::allocator_traits<std::allocator<std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > > > >::construct<std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > >, std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > > const&> (__p=0xfffffffff760b8a0, __a=...) at /cvmfs/lhcb.cern.ch/lib/lcg/releases/gcc/11.1.0-e80bf/x86_64-centos7/include/c++/11.1.0/bits/alloc_traits.h:512#10 std::vector<std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > >, std::allocator<std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > > > >::_M_realloc_insert<std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > > const&> (this=thisentry=0x11ecd780, __position=__positionentry=...) at /cvmfs/lhcb.cern.ch/lib/lcg/releases/gcc/11.1.0-e80bf/x86_64-centos7/include/c++/11.1.0/bits/vector.tcc:449#11 0x00007f511fe1f972 in std::vector<std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > >, std::allocator<std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > > > >::insert (__x=..., __position=..., this=0x11ecd780) at /cvmfs/lhcb.cern.ch/lib/lcg/releases/gcc/11.1.0-e80bf/x86_64-centos7/include/c++/11.1.0/bits/stl_iterator.h:1007#12 GaudiUtils::VectorMap<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > >, std::less<unsigned int const>, std::allocator<std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > > > >::insert (value=..., this=0x11ecd778) at /workspace/build/Gaudi/InstallArea/x86_64_v3-centos7-gcc11-opt+g/include/GaudiKernel/VectorMap.h:358#13 GaudiUtils::VectorMap<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > >, std::less<unsigned int const>, std::allocator<std::pair<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t, std::less<int const>, std::allocator<std::pair<int, HltRawBankDecoderBase::element_t> > > > > >::insert (key=0x7f511309b57c: 0, key=0x7f511309b57c: 0, mapped=..., this=0x11ecd778) at /workspace/build/Gaudi/InstallArea/x86_64_v3-centos7-gcc11-opt+g/include/GaudiKernel/VectorMap.h:312#14 HltRawBankDecoderBase::fetch_id2string (this=thisentry=0x11ecc000, tck=<optimized out>) at ../Hlt/HltDAQ/src/component/HltRawBankDecoderBase.cpp:128#15 0x00007f511fdb66e8 in HltRawBankDecoderBase::id2string (tck=<optimized out>, this=0x11ecc000) at ../Hlt/HltDAQ/src/component/HltRawBankDecoderBase.h:51#16 HltDecReportsDecoder::operator() (this=this
These warnings are expected and should not cause any crash but the stack trace does seem to indicate the decoder is at fault. Maybe it cant find the json?
@shunan can you recreate this error locally? Or does in ONLY happen in the PR tests?
Unfortunately this seems to be a crash only happens in the PR tests. I tried several local runs on 1e4 events (the crash happens at O(1000) event), either single-threaded (-t 1) or multi-threaded (-t 2), run with my old stack (was updated last week) or a newly built stack (from lb-stack-setup today), all tests finished successfully. The tests were run on some random lxplus node.
I just rebuild my stack and just running ./Moore/run DBASE/PRConfig/scripts/benchmark-scripts/Moore_spruce_rate_and_size.sh I can reproduce the error.
Because it's a dbg build I also get some assert that fires:
HltDecReportsDecoder WARNING TCK obtained from rawbank seems to be 0 -- blindly ASSUMING that the current HltANNSvc somehow has the same configuration as when the input data was written. Proceed at your own risk, good luck... *** Break *** HltDecReportsDecoder WARNING TCK obtained from rawbank seems to be 0 -- blindly ASSUMING that the current HltANNSvc somehow has the same configuration as when the input data was written. Proceed at your own risk, good luck... *** Break *** python: ../Hlt/HltDAQ/src/component/HltRawBankDecoderBase.cpp:129: GaudiUtils::VectorMap<unsigned int, GaudiUtils::VectorMap<int, HltRawBankDecoderBase::element_t> >::const_iterator HltRawBankDecoderBase::fetch_id2string(unsigned int) const: Assertion `res.second' failed.HltDecReportsDecoder WARNING TCK obtained from rawbank seems to be 0 -- blindly ASSUMING that the current HltANNSvc somehow has the same configuration as when the input data was written. Proceed at your own risk, good luck...
@shunan I just tested to only run on a single thread and I don't see the crash. So I think it's a thread safety issue.
The problem probably doesn't show on LXPLUS because you only have a few slow threads, so it's harder to trigger the race condition.
Looking at HltRawBankDecoderBase.cpp I do see that the algorithm has multiple member variables that are mutable and get modified during the execution of the operator()
cc @graven
Thanks @chasse for figuring this out! Apparently I can't open too much threads on lxplus, and the private farm of our unveristy has slow communication with eos, that's probably also why the CPU usage is always low even if I open O(100) threads.
the thead-safety in HltRawBankDecoderBase.cpp will be fixed as part of LHCb!3528 (merged) (actually, HltRawBankDecoderBase will cease to exist with that MR)
As this is a thread safety issue, I think single-threaded should be a workaround. The problem of single-threaded is timing. Now it takes ~1h to process 1e6 events on a 16-core machine, so we have to reduce the n_events to ~6000 to make it finish in time. Let me check if 6k events can provide us some reasonable output of rates and event sizes, and I will make the change.
I don't know if Hlt2 is affected by this issue, as a clear difference is Hlt2 does not decode raw banks from a tck json file.
Many thanks @chasse , although it's a temporary fix I think we can now close this issue. Things will eventrally get done in LHCb!3528 (merged) but for this test no more action is needed.