Segfault in ServiceManager
Discovered while debugging https://its.cern.ch/jira/browse/ATR-26084
Throwing an exception in an algorithm causes a stall (as it should) but this can lead to a segfault in ServiceManager. The segfault is reliably of the following form:
#6 0x00007f50d1233978 in ServiceManager::ServiceItem::operator== (name=..., this=0x7ffe7c6d3630) at /home/bwynne/gaudi/GaudiCoreSvc/src/ApplicationMgr/ServiceManager.h:53
#7 __gnu_cxx::__ops::_Iter_equals_val<std::basic_string_view<char, std::char_traits<char> > const>::operator()<std::_List_iterator<ServiceManager::ServiceItem> > (this=<synthetic pointer>, __it=...) at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/bits/predefined_ops.h:241
#8 std::__find_if<std::_List_iterator<ServiceManager::ServiceItem>, __gnu_cxx::__ops::_Iter_equals_val<std::basic_string_view<char, std::char_traits<char> > const> > (__pred=..., __last=..., __first=...) at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/bits/stl_algo.h:104
#9 std::__find_if<std::_List_iterator<ServiceManager::ServiceItem>, __gnu_cxx::__ops::_Iter_equals_val<std::basic_string_view<char, std::char_traits<char> > const> > (__pred=..., __last=..., __first=...) at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/bits/stl_algo.h:161
#10 std::find<std::_List_iterator<ServiceManager::ServiceItem>, std::basic_string_view<char, std::char_traits<char> > > (__val=<synthetic pointer>..., __last=..., __first=...) at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/bits/stl_algo.h:3905
#11 ServiceManager::find (name=..., this=0x1e372e0) at /home/bwynne/gaudi/GaudiCoreSvc/src/ApplicationMgr/ServiceManager.h:145
#12 ServiceManager::service (this=0x1e372e0, typeName=..., createIf=<optimized out>) at /home/bwynne/gaudi/GaudiCoreSvc/src/ApplicationMgr/ServiceManager.cpp:199
#13 0x00007f50d0a8918c in ISvcLocator::service<ITimelineSvc> (createIf=false, typeName=..., this=0x1e37448) at /home/bwynne/gaudi/GaudiKernel/include/GaudiKernel/ISvcLocator.h:118
#14 AvalancheSchedulerSvc::dumpSchedulerState (this=0x1f09e20, iSlot=2) at /home/bwynne/gaudi/GaudiHive/src/AvalancheSchedulerSvc.cpp:821
#15 0x00007f50d0a8a380 in AvalancheSchedulerSvc::eventFailed (this=this
entry=0x1f09e20, eventContext=<optimized out>) at /home/bwynne/gaudi/GaudiHive/src/AvalancheSchedulerSvc.cpp:789
The specific issue arises here: https://gitlab.cern.ch/gaudi/Gaudi/-/blob/master/GaudiCoreSvc/src/ApplicationMgr/ServiceManager.h#L53
bool operator==( std::string_view name ) const { return service->name() == name; }
Implying that service is an invalid pointer.
The call that causes the segfault is coming from here: https://gitlab.cern.ch/gaudi/Gaudi/-/blob/master/GaudiHive/src/AvalancheSchedulerSvc.cpp#L821
auto timelineSvc = serviceLocator()->service<ITimelineSvc>( "TimelineSvc", false );
In short, it's trying to retrieve a service that may not exist, saying don't create it if it is not already there.
This occurs with the attached job options, roughly 4 times in 100 attempts. I've also attached a quick python script to run the trial until a segfault occurs.
Not sure what would cause the problem, although I note that the test has a high degree of parallelism, and can include simultaneous stalls in different event slots/subslots. The algorithm exception does not have to be within a subslot to trigger the segfault, and indeed there do not have to be subslots active. The only requirement seems to be multiple events in flight, with multiple events stalling (at a similar time?).
Can occur when only running with single thread.