Add HLT timeout handling and improve handling of other errors
Implemented handling of soft timeout in the online HLT framework (ATR-16897), improved a few aspects of general error handling (ATR-19248), and fixed a few small things.
Changes:
- Avoid using
ATH_CHECK
for calls which can returnAthena::Status::TIMEOUT
, because it prints out a FATAL message in this case. - Return
Athena::Status::TIMEOUT
on timeout in the MTCalibPeb test tool/alg. - Define
hltonl::PSCErrorCode::TIMEOUT
in TrigKernel - Fix a few uninitialised members in HltEventLoopMgr (mainly related to timeout-handling).
- HltEventLoopMgr: on non-success EventStatus, check if timeout happened and, if yes, send event to a dedicated timeout debug stream.
- HltEventLoopMgr: use EventID from context instead of the old EventInfo in error print-outs.
- HLTResultMTByteStreamCnv: make sure failed events go only to the debug streams (remove stream tags of other type), but still save all the HLT results to this stream, if they're available.
- TrigOutputHandling tools: check validity of HLTSummary handle before using it.
- Add a test of the timeout handling in TrigP1Test (job options and ART test shell script).
Questions already resolved in the discussion below:
- Shall we adapt
ATH_CHECK
to not print FATAL on timeout status?
Answer: The error policy will be reviewed with software coordinators later, for now this MR can be accepted as it is. - I realised it doesn't make much sense to try constructing an HLT result if DecisionSummaryMaker didn't run, and this currently is the case for any timeout/failure. If we want to save as much information as possible to the debug stream, maybe we should run it explicitly in the online framework, and not as an algorithm?
Answer: The partial summary information obtained by running summary maker on aborted events would not be useful for debugging offline, so there is no point running this.
Merge request reports
Activity
It does propagate the StatusCode correctly, but it prints a FATAL message because apparently TIMEOUT is not "recoverable":
https://gitlab.cern.ch/atlas/athena/blob/master/Control/AthenaKernel/AthenaKernel/errorcheck.h#L365-374Thinking about this some more, printing a FATAL is at least consistent. It's the same message level that's printed if you return
StatusCode::FAILURE
from an algorithm/tool. And the event does get send to the debug stream just like for any other failures duringexecute
. I would suggest we keep it like that for the moment.Thinking more from the online perspective, shouldn't FATAL mean that the application cannot continue running and will exit? I guess this is what happens in offline athena when an algorithm returns StatusCode::FAILURE, but that's not the case online. We don't want to propagate a FATAL message from HLT to ERS for issues which don't stop the application.
StreamTagMakerTool and TriggerBitsMakerTool rely on the
HLTSummary
information, so without it we cannot get the PEB information (including data scouting) or HLT bits in the debug stream result. The serialiser doesn't depend onHLTSummary
, so you are right that the individual DecisionContainers will be serialised. Since debug streams do full event building, I add the "main" HLT result to them by default, but if StreamTagMakerTool fails, no other HLT result will be produced. That's because we create only HLT result ROBFragments which are explicitly requested to be written out in at least one stream tag.Different chains will accept the event after different numbers of steps.
If we run the DecisionSummaryMaker after soft timing out in (say) step 5, then chains which accepted the event already somewhere in step 1-4 would be able to properly pass through the DecisionSummaryMaker and set bits, stream tags.
Not sure we'd want this, really.
This merge request affects 7 packages:
- HLT/Trigger/TrigControl/TrigExamples/TrigExPartialEB
- HLT/Trigger/TrigControl/TrigKernel
- HLT/Trigger/TrigControl/TrigServices
- Trigger/TrigEvent/TrigSteeringEvent
- Trigger/TrigSteer/TrigHLTResultByteStream
- Trigger/TrigSteer/TrigOutputHandling
- Trigger/TrigValidation/TrigP1Test
CI Result FAILUREAthena AthSimulation externals cmake make required tests optional tests Full details available at NICOS MR-21931-2019-03-16-02-26
Athena: number of compilation errors 0, warnings 0
AthSimulation: number of compilation errors 0, warnings 0
For experts only: Jenkins output [CI-MERGE-REQUEST 35533]As discussed with @fwinkl, the ATH_CHECK behaviour and the general error policy will need to be re-discussed with a wider audience probably next week. This should not hold this MR and it should be fine to go in the current form. I will un-WIP now.
This merge request affects 7 packages:
- HLT/Trigger/TrigControl/TrigExamples/TrigExPartialEB
- HLT/Trigger/TrigControl/TrigKernel
- HLT/Trigger/TrigControl/TrigServices
- Trigger/TrigEvent/TrigSteeringEvent
- Trigger/TrigSteer/TrigHLTResultByteStream
- Trigger/TrigSteer/TrigOutputHandling
- Trigger/TrigValidation/TrigP1Test
added review-pending-level-1 label
CI Result FAILUREAthena AthSimulation externals cmake make required tests optional tests Full details available at NICOS MR-21931-2019-03-19-02-15
Athena: number of compilation errors 1, warnings 1
AthSimulation: number of compilation errors 0, warnings 2
For experts only: Jenkins output [CI-MERGE-REQUEST 35610]This merge request affects 7 packages:
- HLT/Trigger/TrigControl/TrigExamples/TrigExPartialEB
- HLT/Trigger/TrigControl/TrigKernel
- HLT/Trigger/TrigControl/TrigServices
- Trigger/TrigEvent/TrigSteeringEvent
- Trigger/TrigSteer/TrigHLTResultByteStream
- Trigger/TrigSteer/TrigOutputHandling
- Trigger/TrigValidation/TrigP1Test
CI Result SUCCESSAthena AthSimulation externals cmake make required tests optional tests Full details available at NICOS MR-21931-2019-03-19-16-14
Athena: number of compilation errors 0, warnings 1
AthSimulation: number of compilation errors 0, warnings 0
For experts only: Jenkins output [CI-MERGE-REQUEST 35657]added review-pending-level-2 label and removed review-pending-level-1 label
- Resolved by Rafal Bielski
added review-user-action-required label and removed review-pending-level-2 label
This merge request affects 7 packages:
- HLT/Trigger/TrigControl/TrigExamples/TrigExPartialEB
- HLT/Trigger/TrigControl/TrigKernel
- HLT/Trigger/TrigControl/TrigServices
- Trigger/TrigEvent/TrigSteeringEvent
- Trigger/TrigSteer/TrigHLTResultByteStream
- Trigger/TrigSteer/TrigOutputHandling
- Trigger/TrigValidation/TrigP1Test
added review-pending-level-1 label and removed review-user-action-required label
added review-pending-level-2 label and removed review-pending-level-1 label
Back to @grancagn for final approval
CI Result SUCCESSAthena AthSimulation externals cmake make required tests optional tests Full details available at NICOS MR-21931-2019-03-21-18-53
Athena: number of compilation errors 0, warnings 1
AthSimulation: number of compilation errors 0, warnings 0
For experts only: Jenkins output [CI-MERGE-REQUEST 35788]mentioned in merge request !22094 (merged)
Warning seems to be related to !21787 (comment 2458512) and not from this MR.
Approving as agreed.
Edited by Sergio Grancagnoloadded review-approved label and removed review-pending-level-2 label
mentioned in commit 1e961818
added sweep:ignore label