Skip to content

Test for the scheduler hang on alg exceptions

Discovered in the AthenaMT HLT, it is possible for the scheduler to hang if an algorithm running in a sub-slot/view throws an exception.

Throwing an exception causes the event to be marked as failed. It also means that the part of the code that updates the algorithm execution state is bypassed - the state is left as "executing" without a returned StatusCode.

Since the AlgExecStateSvc does not (currently) understand sub-slots, an algorithm running in multiple sub-slots shares a single state instance across all sub-slots. If it runs successfully in sub-slot 1, but throws an exception in sub-slot 2, the state from sub-slot 1 will be used.

So, it is possible to have a failed event, without any algorihms in ERROR state. The scheduler does not have handling for this, and hangs.

There are a number of possible fixes for this - see issue #93 (closed). For now, this is the demonstration of the problem and a possible fix. Would appreciate feedback if there's a better idea.

Edited by Benjamin Michael Wynne

Merge request reports