Review scheduler retry logic for archive and retrieve
The scheduler implements retry logic for archives and retrieves based on a certain number of attempts over a certain number of mounts. These numbers are currently hardcoded;
-
Retrieves: 2 mounts * 3 retrieswithinmount = 6 totalretries
-
Archives: 1 mount * 2 retrieswithinmount = 2 totalretries
-
https://gitlab.cern.ch/cta/CTA/-/blob/main/objectstore/RetrieveRequest.cpp#L535
-
https://gitlab.cern.ch/cta/CTA/-/blob/main/scheduler/OStoreDB/OStoreDB.cpp#L766
These numbers, in conjunction with some timeouts, mean that under some circumstances (e.g. EOS unavailable) the elapsed time over which we retry is very short (of order minutes depending on the queue). We then give up and register a failed request. This was observed when eosctapublic
went down over a weekend.
We should try harder to archive and to retrieve. EOS instances will come back on a timescale of hours (hopefully less...), so we should at least stretch this far. One way would be to encapsulate the existing logic and run it again a certain number of hours later. We could also get smarter at detecting why a mount fails.
CTA operations tickets affected
Here are all CTA operations retry logic issues.