Skip to content

Improve handling of Frontier warnings

Mark Stockton requested to merge mark/athena:master-atr11044 into master

This is to resolve problems seen in ATR-11044 (and recent jobs ATR-21488) which will remove wasting walltime for jobs that succeed but are being marked as ERROR incorrectly.

If there is a line starting with the format warn [frontier (with any amount of whitespace after warn) from Frontier (via TrigConf2COOLLib.py) this will now be printed as a single line INFO message, but the rest of the log will not be produced (unless in DEBUG) and it will no longer be promoted to ERROR

RDOtoRDOTrigger 01:17:58 Py:TrigConf2COOLLib.py    INFO Caught warning from frontier: 04:42:40 warn  [frontier.c:1025]: Request 1 on chan 1 failed at Fri May 22 04:42:32 2020: -9 [fn-socket.c:107]: network error on connect to 172.16.122.51:3128: No route to host
RDOtoRDOTrigger 01:17:58 Py:TrigConf2COOLLib.py    INFO Successful execution of command

Testing explicitly with this known error message added to the log did not stop Reco_tf from completing during the file validation.

The messages are printed to highlight any issues with connections slowing the job and incase this string matching is too strict to help potential issues later in the job - though problems are unlikely as frontier warnings are defined as: "mainly have to do with non-fatal connection problems and retries" http://frontier.cern.ch/dist/FrontierClientUsage.html

Note there is also an extra INFO statement to mark when TrigConf2COOLLib.py has completed executing the command. I also tested with outputting line by line, but the output is large and so is neater in the current format to not confuse with other messages.

If there is an actual warning/error/fatal from frontier that doesn’t match the search string, then the log will be written out with the format (here reflecting an ERROR) with the start and end of the log now made clearer

RDOtoRDOTrigger 01:26:26 Py:TrigConf2COOLLib.py ERROR Log file from execution of command:
RDOtoRDOTrigger 01:26:26 ========================================
RDOtoRDOTrigger 01:26:26 JOB SETUP:
RDOtoRDOTrigger 01:26:26 ...
RDOtoRDOTrigger 01:26:26 SessionMgr                     INFO Closing session TRIGGERDBMC
RDOtoRDOTrigger 01:26:26 ========================================
RDOtoRDOTrigger 01:26:26 End of log file from TrigConf2COOLLib.py

Reco TRF would fail in this case, but hopefully clearer when first looking at the log where the error is.

Merge request reports