Introduce a new way to check timeout on events
The mechanism provided by StalledEventMonitor and WatchdogThread are based on the BeginEvent incident, which cannot be used with the same guarantees in multi-threaded jobs.
This MR replaces WatchdogThread with an RAII object that can be stored in the TES and that periodically executes a callback: Gaudi::Utils::PeriodicAction.
Thanks to PeriodicAction it is possible to have an algorithm that adds to the TES a timeout checker to each event so that multiple events in flight can be checked independently and reliably. Gaudi::EventWatchdogAlg does exactly this with the same features available in StalledEventMonitor (log messages, stack trace, abort) and some improvements (like printing the EventContext of the hanging event).
I tried to preserve as much as possible backward compatibility:
-
StalledEventMonitorandWatchdogThreadare still available, but should not be used - the option
ApplicationMgr.StalledEventMonitoringnow addsGaudi::EventWatchdogAlgto the beginning ofTopAlg, but it is better to use directly the algorithm -
Gaudi::EventWatchdogAlguses the same property names and types ofStalledEventMonitor(it also uses the configuration ofStalledEventMonitorfrom theJobOptionsSvc)
Content:
- modernization of
WatchdogThread(not really needed, but I do not want to throw it away as I did it during the development) - make
GaudiTesting::SleepyAlgre-entrant (to properly test the watchdog in multi-threading) - add
Gaudi::Utils::PeriodicActionas a helper to periodically invoke a callback - add
Gaudi::EventWatchdogAlgas a replacement forStalledEventMonitor - add a test to validate
Gaudi::EventWatchdogAlgworks in a multi-threaded job - add an example to explain how to properly use
Gaudi::EventWatchdogAlg
Closes #287 (closed)