Introduce a new way to check timeout on events
The mechanism provided by StalledEventMonitor
and WatchdogThread
are based on the BeginEvent incident, which cannot be used with the same guarantees in multi-threaded jobs.
This MR replaces WatchdogThread
with an RAII object that can be stored in the TES and that periodically executes a callback: Gaudi::Utils::PeriodicAction
.
Thanks to PeriodicAction
it is possible to have an algorithm that adds to the TES a timeout checker to each event so that multiple events in flight can be checked independently and reliably. Gaudi::EventWatchdogAlg
does exactly this with the same features available in StalledEventMonitor
(log messages, stack trace, abort) and some improvements (like printing the EventContext of the hanging event).
I tried to preserve as much as possible backward compatibility:
-
StalledEventMonitor
andWatchdogThread
are still available, but should not be used - the option
ApplicationMgr.StalledEventMonitoring
now addsGaudi::EventWatchdogAlg
to the beginning ofTopAlg
, but it is better to use directly the algorithm -
Gaudi::EventWatchdogAlg
uses the same property names and types ofStalledEventMonitor
(it also uses the configuration ofStalledEventMonitor
from theJobOptionsSvc
)
Content:
- modernization of
WatchdogThread
(not really needed, but I do not want to throw it away as I did it during the development) - make
GaudiTesting::SleepyAlg
re-entrant (to properly test the watchdog in multi-threading) - add
Gaudi::Utils::PeriodicAction
as a helper to periodically invoke a callback - add
Gaudi::EventWatchdogAlg
as a replacement forStalledEventMonitor
- add a test to validate
Gaudi::EventWatchdogAlg
works in a multi-threaded job - add an example to explain how to properly use
Gaudi::EventWatchdogAlg
Closes #287 (closed)