TrackMasterFItter: Add mutex locks to operator()
Temporary work around to ensure TrackMasterFitter is thread safe by adding lock in operator(). Thread-safety issues noticed when running alignment where it crashes randomly (although rarely). These issues also detected by helgrind
Merge request reports
Activity
added RTA label
25 25 #include "TrackInterfaces/ITrackFitter.h" 26 26 #include "TrackInterfaces/ITrackKalmanFilter.h" 27 27 28 // C++ stuff Can you please also add additional information on what the actual problem/symptom is which this 'sledge hammer' approach would solve? As it is clear that this is going to be a major bottleneck, and it would be better if any synchronization could be pushed down to a more fine-grained level -- so there should be an accompanying issue that documents what needs to happen so that these locks can be removed again.
(and if the problem is that hellgrind complains about non-threadsafe
RK::ErrorCounters
, then it isn't really an observable problem)@jcobbled I'd also prefer if you could provide more info as to what actually crashes and maybe even a stack trace.
Please note, that up to the beginning of this month we were still running the
hlt2_reco_baseline
configuration in the throughput tests. See for example this result.
This config runs theTrackMasterFitter
, thus I'm skeptical that the Alignment sees a problem that we didn't see for years in the throughput tests.I guess it's more a problem of misconfiguration or a threading issue in a different component.
I suspect (as mentioned above) that hellgrind is complaining about data races involving these mutable data members -- none of which impact the results, just some (not?) reported statistics which no-one actually looks at which may be slightly wrong... Hence I do not think that these locks actually do anything useful other than silencing hellgrind.
I think this MR can be dropped for now. I will do some more tests with alignment without this lock to see if I can produce an error caused by TrackMasterFit. Helgrind reported data races such as the below which I guess are not too important.
==32663== Possible data race during read of size 8 at 0xD2AD6210 by thread #18 ==32663== Locks held: none ==32663== at 0xE8C53A55: RK::ErrorCode TrackRungeKuttaExtrapolator::extrapolate<RK::Scheme::ButcherTableau<6u, false>, TrackRungeKuttaExtrapolator::doNothing>(RK::Scheme::ButcherTableau<6u, false> const&, RK::details::State<double>&, double, RK::details::Jacobian<double>*, TrackRungeKuttaExtrapolator::doNothing) const (in /hepgpu10-data1/johncob/alignment/upgrade-hackathon-setup/Rec/InstallArea/x86_64_v2-centos7-gcc11-dbg/lib/libTrackExtrapolators.so) ==32663== by 0xE8C370F6: auto RK::Scheme::invoke_with_tableau<TrackRungeKuttaExtrapolator::propagate(RK::details::State<double>&, double, ROOT::Math::SMatrix<double, 5u, 5u, ROOT::Math::MatRepStd<double, 5u, 5u> >*, IGeometryInfo const&, LHCb::Tr::PID) const::{lambda(auto:1 const&)#1}>(TrackRungeKuttaExtrapolator::propagate(RK::details::State<double>&, double, ROOT::Math::SMatrix<double, 5u, 5u, ROOT::Math::MatRepStd<double, 5u, 5u> >*, IGeometryInfo const&, LHCb::Tr::PID) const::{lambda(auto:1 const&)#1}, RK::Scheme::scheme_t) (in /hepgpu10-data1/johncob/alignment/upgrade-hackathon-setup/Rec/InstallArea/x86_64_v2-centos7-gcc11-dbg/lib/libTrackExtrapolators.so) ==32663== by 0xE8C37892: TrackRungeKuttaExtrapolator::propagate(RK::details::State<double>&, double, ROOT::Math::SMatrix<double, 5u, 5u, ROOT::Math::MatRepStd<double, 5u, 5u> >*, IGeometryInfo const&, LHCb::Tr::PID) const (in /hepgpu10-data1/johncob/alignment/upgrade-hackathon-setup/Rec/InstallArea/x86_64_v2-centos7-gcc11-dbg/lib/libTrackExtrapolators.so) ==32663== by 0xE8C37E02: TrackRungeKuttaExtrapolator::propagate(ROOT::Math::SVector<double, 5u>&, double, double, ROOT::Math::SMatrix<double, 5u, 5u, ROOT::Math::MatRepStd<double, 5u, 5u> >*, IGeometryInfo const&, LHCb::Tr::PID) const (in /hepgpu10-data1/johncob/alignment/upgrade-hackathon-setup/Rec/InstallArea/x86_64_v2-centos7-gcc11-dbg/lib/libTrackExtrapolators.so) ==32663== by 0xE8C16F45: TrackMasterExtrapolator::propagate(ROOT::Math::SVector<double, 5u>&, double, double, ROOT::Math::SMatrix<double, 5u, 5u, ROOT::Math::MatRepStd<double, 5u, 5u> >*, IGeometryInfo const&, LHCb::Tr::PID) const (in /hepgpu10-data1/johncob/alignment/upgrade-hackathon-setup/Rec/InstallArea/x86_64_v2-centos7-gcc11-dbg/lib/libTrackExtrapolators.so) ==32663== by 0xE98518F4: TrackMasterFitter::initializeRefStates(LHCb::Event::v1::Track&, IGeometryInfo const&, LHCb::Tr::PID) const (in /hepgpu10-data1/johncob/alignment/upgrade-hackathon-setup/Rec/InstallArea/x86_64_v2-centos7-gcc11-dbg/lib/libTrackFitter.so) ==32663== by 0xE9851B23: TrackMasterFitter::makeNodes(LHCb::Event::v1::Track&, LHCb::Tr::PID, std::any&, IGeometryInfo const&) const (in /hepgpu10-data1/johncob/alignment/upgrade-hackathon-setup/Rec/InstallArea/x86_64_v2-centos7-gcc11-dbg/lib/libTrackFitter.so) ==32663== by 0xE985258F: TrackMasterFitter::fit_r(LHCb::Event::v1::Track&, std::any&, IGeometryInfo const&, LHCb::Tr::PID) const (in /hepgpu10-data1/johncob/alignment/upgrade-hackathon-setup/Rec/InstallArea/x86_64_v2-centos7-gcc11-dbg/lib/libTrackFitter.so) ==32663== by 0xE9853884: TrackMasterFitter::operator()(LHCb::Event::v1::Track&, IGeometryInfo const&, LHCb::Tr::PID const&) const (in /hepgpu10-data1/johncob/alignment/upgrade-hackathon-setup/Rec/InstallArea/x86_64_v2-centos7-gcc11-dbg/lib/libTrackFitter.so) ==32663== by 0x4BC17DDF: TrackBestTrackCreator::fitAndUpdateCounters((anonymous namespace)::TrackData&, IGeometryInfo const&) const (in /hepgpu10-data1/johncob/alignment/upgrade-hackathon-setup/Rec/InstallArea/x86_64_v2-centos7-gcc11-dbg/lib/libTrackUtils.so) ==32663== by 0x4BC185FD: TrackBestTrackCreator::fitAndSelect((anonymous namespace)::TrackData&, IGeometryInfo const&) const (in /hepgpu10-data1/johncob/alignment/upgrade-hackathon-setup/Rec/InstallArea/x86_64_v2-centos7-gcc11-dbg/lib/libTrackUtils.so) ==32663== by 0x4BC18EB7: TrackBestTrackCreator::operator()(Gaudi::Functional::details::vector_of_const_<KeyedContainer<LHCb::Event::v1::Track, Containers::KeyedObjectManager<Containers::hashmap> > > const&) const::{lambda((anonymous namespace)::TrackData&)#4} std::for_each<__gnu_cxx::__normal_iterator<std::reference_wrapper<(anonymous namespace)::TrackData>*, std::vector<std::reference_wrapper<(anonymous namespace)::TrackData>, std::allocator<std::reference_wrapper<(anonymous namespace)::TrackData> > > >, TrackBestTrackCreator::operator()(Gaudi::Functional::details::vector_of_const_<KeyedContainer<LHCb::Event::v1::Track, Containers::KeyedObjectManager<Containers::hashmap> > > const&) const::{lambda((anonymous namespace)::TrackData&)#4}>(__gnu_cxx::__normal_iterator<std::reference_wrapper<(anonymous namespace)::TrackData>*, std::vector<std::reference_wrapper<(anonymous namespace)::TrackData>, std::allocator<std::reference_wrapper<(anonymous namespace)::TrackData> > > >, TrackBestTrackCreator::operator()(Gaudi::Functional::details::vector_of_const_<KeyedContainer<LHCb::Event::v1::Track, Containers::KeyedObjectManager<Containers::hashmap> > > const&) const::{lambda((anonymous namespace)::TrackData&)#4}, TrackBestTrackCreator::operator()(Gaudi::Functional::details::vector_of_const_<KeyedContainer<LHCb::Event::v1::Track, Containers::KeyedObjectManager<Containers::hashmap> > > const&) const::{lambda((anonymous namespace)::TrackData&)#4}) (in /hepgpu10-data1/johncob/alignment/upgrade-hackathon-setup/Rec/InstallArea/x86_64_v2-centos7-gcc11-dbg/lib/libTrackUtils.so)
If you want to make sure, then just remove all code that touches any of those four mutable data members (
m_errors
,m_numcalls
,m_totalstats
, andm_stats
) fromTrackRungeKuttaExtrapolator
. That should be cleanly/trivially possible, and it will also show that this has, by construction, no effect on the results.actually, please just try with !2807 (merged)...
151 151 //========================================================================= 152 152 StatusCode TrackMasterFitter::operator()( LHCb::span<LHCb::Track> tracks, IGeometryInfo const& geometry, 153 153 const LHCb::Tr::PID& pid ) const { 154 auto cache = createCache(); 155 StatusCode sc( StatusCode::SUCCESS ); 154 // add lock to ensure thread-safety 155 std::lock_guard<std::mutex> lock( m_operatorLock ); assigned to @lpica
mentioned in merge request !2807 (merged)
mentioned in issue Moore#416 (closed)
unassigned @fsouzade
mentioned in issue Moore#421 (closed)