Skip to content

Review scheduler retry logic for archive and retrieve

The scheduler implements retry logic for archives and retrieves based on a certain number of attempts over a certain number of mounts. These numbers are currently hardcoded;

These numbers, in conjunction with some timeouts, mean that under some circumstances (e.g. EOS unavailable) the elapsed time over which we retry is very short (of order minutes depending on the queue). We then give up and register a failed request. This was observed when eosctapublic went down over a weekend.

We should try harder to archive and to retrieve. EOS instances will come back on a timescale of hours (hopefully less...), so we should at least stretch this far. One way would be to encapsulate the existing logic and run it again a certain number of hours later. We could also get smarter at detecting why a mount fails.

CTA operations tickets affected

Here are all CTA operations retry logic issues.

Edited by Joao Afonso