Skip to content

Move to event-wise stall detection

Depends on !689 (merged), please merge after.

Since the initial GaudiHive prototype, stall detection had always been global (despite the signature of AvalancheSchedulerSvc::isStall - let it not confuse you..). This means that if a stall doesn't span all events then it won't be detected until either the job runs out of events completely, or until the job "collects" enough event stalls thus gradually exhausting all available slots. In either case, stalled event slots are out of play up until the job finally fails.

This MR implements proper intra-event stall detection.

The development will also help to redirect stalled events to debug stream in addition to failed ones in ATLAS HLT. This functionality is currently being developed by @fwinkl.

Aside, in AvalancheSchedulerSvc:

  • prefer pre-increment operators over post-increment ones;
  • remove unused data member.
Edited by Illya Shapoval

Merge request reports