HLTScheduler: Only set non-zero application return code after allowed number of handled event failures
See for context discussion in lhcb-dpa/prod-requests#76
In short(ish) ...
Offline processing now sets the StopAfterNFailures
property of HLTControlFlowMgr
to a value >1 (the default) to allow a limited number of well handled single event processing errors through. The idea being to handle rare one off single processing errors in a way that skips the problem event and allows the rest of the file to be processed. Example of these can be seen in the above link but are mostly related to some sort of rare raw bank corruption (which needs addressing at sort but that is another story).
The problem is Dirac only has two sorts of application return codes. 'Zero' (SUCCESS) and 'Not Zero' (FATAL) and anything in the second category is considered a failed processing.
The logic in the HLTControlFlowMgr
always set the application return code to Gaudi::ReturnCode::AlgorithmFailure
(3) on the first error instance which means even if StopAfterNFailures
was set >1 the return code was non zero, and Dirac considers it a failure.
This MR simply updates the logic so the return code is only set to Gaudi::ReturnCode::AlgorithmFailure
if the StopAfterNFailures
limit is met. Otherwise, it is not and the application will still terminate cleanly (ret code 0) and thus Dirac will consider it a successfully processed file.
Targeting 2024-patches, even though data taking is close to the end, as this is primarily for offline sprucing processing.