Review scheduler retry logic for archive and retrieve

changed the description

added workflowAccepted in CTA-old label

changed the description

Discussed at today's dev meeting that the solution for this could be to put the queue to sleep in case of repeated failures.

This can therefore follow on from the backpressure configuration, which also puts queues to sleep, see https://gitlab.cern.ch/cta/operations/-/issues/359.

assigned to @mvelosob

added workflowAssigned in CTA-old label and removed workflowAccepted in CTA-old label

I will be taking a look at this once !70 (merged) has been merged

The most obvious case were we should put the queue to sleep is when the EOS disk instance is unreachable. Are there any more cases that we have seen in production were we also want to put the queue to sleep?

added workflowIn Progress in CTA-old label and removed workflowAssigned in CTA-old label

I am implementing what was discussed of putting the queue to sleep when the eos instance becomes unreachable. We should discuss what other error scenarios warrant putting the queue to sleep. To do so I prupose everyone add errors they have seen in production to https://gitlab.cern.ch/cta/CTA/-/issues/1099 so we can discuss it in a future meeting.

created branch 1023-review-scheduler-retry-logic-for-archive-and-retrieve to address this issue

As mentioned in the meeting:

The EOS unavailable problem has been solved for retrieve (and the fix is merged in master).

I am looking for a way to solve it for archive as well.

Other failure reasons should be discussed in https://gitlab.cern.ch/cta/CTA/-/issues/1099

I have been looking into solving this for archive, it is more complicated. The tape mount creates a ReadFile from the URL in the archive request here. If I am not mistaken, this will be an XrootReadFile. When the file is opened, if the respective instance is not available, a generic exception is thrown here.

So we have no way to distinguish in CTA code between other errors that may ocurr, in which we do not want to put the queue to sleep.

Furthermore, for retrieve, the time to put the queue to sleep is given by the disk system of the retrieve mount. Archive has no such disk system associated (i.e. an entry in the DISK_SYSTEM table). A possible alternative is to just have some default value configured in /etc/cta/cta-frontend-xrootd.conf or to connect the DISK_SYSTEM to archive as well (more complicated).

For the second problem, we can use the srcURL of the job and get the disk system from there. This is effectively tying the new disk system table work to archive, which we planed to do anyway.

Marking this as blocked until we deploy the disk instance table.

This can proceed after 4.7.0

added workflowBlocked in CTA-old label and removed workflowIn Progress in CTA-old label

changed the description

added workflowAssigned in CTA-old label and removed workflowBlocked in CTA-old label

Review scheduler retry logic for archive and retrieve

CTA operations tickets affected

Designs

Child items ...

Activity

Review scheduler retry logic for archive and retrieve

CTA operations tickets affected

Is blocked by

Relates to

Activity