Add HLT timeout handling and improve handling of other errors
Implemented handling of soft timeout in the online HLT framework (ATR-16897), improved a few aspects of general error handling (ATR-19248), and fixed a few small things.
Changes:
- Avoid using
ATH_CHECK
for calls which can returnAthena::Status::TIMEOUT
, because it prints out a FATAL message in this case. - Return
Athena::Status::TIMEOUT
on timeout in the MTCalibPeb test tool/alg. - Define
hltonl::PSCErrorCode::TIMEOUT
in TrigKernel - Fix a few uninitialised members in HltEventLoopMgr (mainly related to timeout-handling).
- HltEventLoopMgr: on non-success EventStatus, check if timeout happened and, if yes, send event to a dedicated timeout debug stream.
- HltEventLoopMgr: use EventID from context instead of the old EventInfo in error print-outs.
- HLTResultMTByteStreamCnv: make sure failed events go only to the debug streams (remove stream tags of other type), but still save all the HLT results to this stream, if they're available.
- TrigOutputHandling tools: check validity of HLTSummary handle before using it.
- Add a test of the timeout handling in TrigP1Test (job options and ART test shell script).
Questions already resolved in the discussion below:
- Shall we adapt
ATH_CHECK
to not print FATAL on timeout status?
Answer: The error policy will be reviewed with software coordinators later, for now this MR can be accepted as it is. - I realised it doesn't make much sense to try constructing an HLT result if DecisionSummaryMaker didn't run, and this currently is the case for any timeout/failure. If we want to save as much information as possible to the debug stream, maybe we should run it explicitly in the online framework, and not as an algorithm?
Answer: The partial summary information obtained by running summary maker on aborted events would not be useful for debugging offline, so there is no point running this.
Edited by Rafal Bielski