Skip to content

HLTScheduler: Only set non-zero application return code after allowed number of handled event failures

See for context discussion in lhcb-dpa/prod-requests#76

In short(ish) ...

Offline processing now sets the StopAfterNFailures property of HLTControlFlowMgr to a value >1 (the default) to allow a limited number of well handled single event processing errors through. The idea being to handle rare one off single processing errors in a way that skips the problem event and allows the rest of the file to be processed. Example of these can be seen in the above link but are mostly related to some sort of rare raw bank corruption (which needs addressing at sort but that is another story).

The problem is Dirac only has two sorts of application return codes. 'Zero' (SUCCESS) and 'Not Zero' (FATAL) and anything in the second category is considered a failed processing.

The logic in the HLTControlFlowMgr always set the application return code to Gaudi::ReturnCode::AlgorithmFailure (3) on the first error instance which means even if StopAfterNFailures was set >1 the return code was non zero, and Dirac considers it a failure.

This MR simply updates the logic so the return code is only set to Gaudi::ReturnCode::AlgorithmFailure if the StopAfterNFailures limit is met. Otherwise, it is not and the application will still terminate cleanly (ret code 0) and thus Dirac will consider it a successfully processed file.

Targeting 2024-patches, even though data taking is close to the end, as this is primarily for offline sprucing processing.

FYI @nskidmor @cburr @mslater

Edited by Christopher Rob Jones

Merge request reports

Loading