Skip to content

Introduce a new way to check timeout on events

Marco Clemencic requested to merge add-alg-timeout-monitor-to-scheduler into master

The mechanism provided by StalledEventMonitor and WatchdogThread are based on the BeginEvent incident, which cannot be used with the same guarantees in multi-threaded jobs.

This MR replaces WatchdogThread with an RAII object that can be stored in the TES and that periodically executes a callback: Gaudi::Utils::PeriodicAction.

Thanks to PeriodicAction it is possible to have an algorithm that adds to the TES a timeout checker to each event so that multiple events in flight can be checked independently and reliably. Gaudi::EventWatchdogAlg does exactly this with the same features available in StalledEventMonitor (log messages, stack trace, abort) and some improvements (like printing the EventContext of the hanging event).

I tried to preserve as much as possible backward compatibility:

  • StalledEventMonitor and WatchdogThread are still available, but should not be used
  • the option ApplicationMgr.StalledEventMonitoring now adds Gaudi::EventWatchdogAlg to the beginning of TopAlg, but it is better to use directly the algorithm
  • Gaudi::EventWatchdogAlg uses the same property names and types of StalledEventMonitor (it also uses the configuration of StalledEventMonitor from the JobOptionsSvc)

Content:

  • modernization of WatchdogThread (not really needed, but I do not want to throw it away as I did it during the development)
  • make GaudiTesting::SleepyAlg re-entrant (to properly test the watchdog in multi-threading)
  • add Gaudi::Utils::PeriodicAction as a helper to periodically invoke a callback
  • add Gaudi::EventWatchdogAlg as a replacement for StalledEventMonitor
  • add a test to validate Gaudi::EventWatchdogAlg works in a multi-threaded job
  • add an example to explain how to properly use Gaudi::EventWatchdogAlg

Closes #287 (closed)

Merge request reports

Loading