Increase timeout in runnning jobs in dirac (!90) · Merge requests · lhcb-dpa / Analysis Productions / LbAPI

Emir Muhammad requested to merge emmuhamm_increase_timeout into main Oct 10, 2023

This merge request increases the timeout time from 10 seconds to 60 seconds in an attempt to alleviate the timeout issue plaguing some MC Request timeouts recently. Its probably caused by a change in how CVMFS is mounted on openshift. But at the moment, this timeout should alleviate some of the timeouts happening already :)

Copied from discussions between @cburr and me.

There were plenty of jobs that were submitted by the CI but some of them were almost immediately cancelled.

Doing multiple resubmits by pressing the 'retry cancelled' whenever it happens usually fixes the issue, but this is quite a concerning behaviour. Example: lhcb-simulation/mc-requests!255 (merged), and a couple of others when looking in the pipelines

I had a look at the sentry, and saw that there were quite a large amount of timeouts happening in the last week. Most of them seem to specify to this particular error https://lhcb-dpa-sentry.web.cern.ch/lhcb-dpa/lbapi/issues/43/events/11396/ (which corresponds to this event I think). The traceback if I understood correctly seems to be because of a timeout in check_dirac_wms_status. I'm not sure if this is true, or how to fix this so it doesn't happen.

Increase timeout in runnning jobs in dirac

Merge request reports