TrackEventFitter - Thread unsafe counters
Whilst working on Panoptes!116 (merged) I have intermittently been running into a crash when running with multi-threading enabled.
[INFO] Process: '' Number of elements in backtrace: 21
/usera/jonesc/LHCbCMake/Feature/Gaudi/InstallArea/x86_64+avx2+fma-centos7-gcc9-opt/lib/libGaudiCoreSvc.so/usera/jonesc/LHCbCMake/Feature/Online/InstallArea/x86_64+avx2+fma-centos7-gcc9-opt/lib/libOnlineBase.so(_ZN3RTL17ExitSignalHandler10back_traceEv+0x27()[0x7f07aed2e307]
_ZN5Gaudi10Monitoring14MessageSvcSink14registerEntityENS0_3Hub6EntityE/usera/jonesc/LHCbCMake/Feature/Online/InstallArea/x86_64+avx2+fma-centos7-gcc9-opt/lib/libOnlineBase.so(_ZN3RTL17ExitSignalHandler7handlerEiP9siginfo_tPv+0x+0x20d)[0x7f07aed307ad]
/lib64/libpthread.so.0(+0xf630)[0x7f07e4d1d630]
61/usera/jonesc/LHCbCMake/Feature/Gaudi/InstallArea/x86_64+avx2+fma-centos7-gcc9-opt/lib/libGaudiCoreSvc.so)(_ZN5Gaudi10Monitoring14MessageSvcSink14registerEntityENS0_3Hub6EntityE+0x61)[0x7f07c3d33591]
[0x/usera/jonesc/LHCbCMake/Feature/Gaudi/InstallArea/x86_64+avx2+fma-centos7-gcc9-opt/lib/libGaudiAlgLib.so(7f07c3d33591_ZN11GaudiCommonI9AlgorithmE7counterESt17basic_string_viewIcSt11char_traitsIcEE+0x760)[0x7f07c1a32fe0]
/usera/jonesc/LHCbCMake/Feature/Rec/InstallArea/x86_64+avx2+fma-centos7-gcc9-opt/lib/libTrackFitter.so]
The above shows the issue is coming from the use of an old style GaudiAlg counter
from something in libTrackFitter.so
. Unfortunately I haven't been able to get the crash with a debug build, so lack code line numbers, but searching the code from that library there is really only one place it could be, which is in TrackEventFitter
there are 5 calls to counter(string) there, and putting in a debug print statement I know they are being called in test the above MR adds. I thinks its clear that sometimes two threads try and create the same counter at the same time, and boom. I also suspect this error is not see in other Moore tests as there the jobs tend not to run at info() level, and 4 of the 5 counters are skipped in this case. I think though we should not rely on this and fix it properly...
Solutions I see are
- Convert counters to new style thread safe ones. Complication here, likely why not done already, is that the counter name is dynamically created based on track type, so not really easy to know up front, which the new counters require. One question I would ask is would we loose much just having a single counter for all tracks a fitter gets to see, so without the type ? I guess the idea anyway is we run separate fitters for each type anyway, so the 'type' is implicit from the fitter instance ?
- Remove counters and move to monitoring/checking algorithms.
- Put a mutex lock in to protect the calls to
counter(string)
. - Protect the implementation of
counter(string)
to make it fully thread safe. - Probably others I have not thought of