Test for the scheduler hang on alg exceptions
Discovered in the AthenaMT HLT, it is possible for the scheduler to hang if an algorithm running in a sub-slot/view throws an exception.
Throwing an exception causes the event to be marked as failed. It also means that the part of the code that updates the algorithm execution state is bypassed - the state is left as "executing" without a returned StatusCode.
Since the AlgExecStateSvc does not (currently) understand sub-slots, an algorithm running in multiple sub-slots shares a single state instance across all sub-slots. If it runs successfully in sub-slot 1, but throws an exception in sub-slot 2, the state from sub-slot 1 will be used.
So, it is possible to have a failed event, without any algorihms in ERROR state. The scheduler does not have handling for this, and hangs.
There are a number of possible fixes for this - see issue #93 (closed). For now, this is the demonstration of the problem and a possible fix. Would appreciate feedback if there's a better idea.
Merge request reports
Activity
added C++ framework task scheduling labels
mentioned in issue #93 (closed)
added 1 commit
- 8fd01a94 - Updated test so you can see the error output in the sub slot
Hi @leggett,
According to the comments you wrote the AlgExecStateSvc, so I wondered if you had any thoughts about this change.
As I can see it there are a couple of possible issues:
- The state bookkeeping could grow quite large. At present I don't delete any structures once allocated, to avoid overheads. The contents just get reset at the end of an event.
- An algorithm that runs in a view will probably not have an AlgExecState set at the whole-event level. This might be an issue for other objects that use the AlgExexStateSvc: https://acode-browser.usatlas.bnl.gov/lxr/ident?_i=algExecState
I could instead track sub-slot states within the AlgExecState object itself, with an optional argument to pass the EventContext when setting them in the first place. The AlgExecState could then return a summary decision across multiple sub-slots when queried at the whole-event level?
Alternatively alternatively, we can just say that there's no need to track states individually in sub-slots, and fix the exception handling. Might just be storing up trouble for later though.
Regards, Ben
changed milestone to %v33r2
assigned to @clemenci
- Resolved by Marco Clemencic
/ci-test --merge
- [2020-05-20 09:49] Validation started with lhcb-master-mr#827
added 242 commits
-
0a496f90...c7d6fdfd - 240 commits from branch
gaudi:master
- 8ff2c7ed - Merge remote-tracking branch 'upstream/master' into SubSlotExceptions
- 09ad2030 - Updated ref
-
0a496f90...c7d6fdfd - 240 commits from branch