[TrigServices] Fix HLT timeout monitoring in the case of failed events (!50665) · Merge requests · atlas / athena

Rafal Bielski requested to merge rbielski/athena:hlt-reset-timer-failed-event into master Feb 21, 2022

Two commits:

Fix a bug where the timeout monitoring flag and timer wasn't reset after a failed event, it was only reset after successfully finished events. A helper function is introduced to avoid code duplication.
Change the soft timeout error message to only tell the limit which was exceeded, not by how much. The real time information is not too relevant and having consistent messages between occurrences might help filtering/grepping them. Suggested by @stelzer.

This addresses issues observed today at P1 during M11 tests with the ATLAS partition. The issue occurs in a very specific scenario which is difficult to reproduce offline, so the particular case couldn't be tested. I only tested this with test_trigP1_timeout_build.py. For future reference, the scenario is:

An event in slot N fails with an error.
The processing continues but there are no new L1 events (the trigger is on hold or L1 rate is low), so the slot is not refilled for a long time.
When the slot is not refilled for a time exceeding the timeout threshold, the soft timeout error incorrectly occurs.

FYI @astruebi, @palacino, @cmerlass

Edited Feb 21, 2022 by Rafal Bielski

[TrigServices] Fix HLT timeout monitoring in the case of failed events

Merge request reports