Safely handle empty shards in object store
As discussed in https://gitlab.cern.ch/cta/operations/-/issues/1190, we are getting some severe problems (queues unmounted, files failing to archive) due to empty shard objects.
Instead of aborting the request popping --- when facing the missing shard --- the object store should simply log it as an ERROR and continue popping the remaining objects.
This is a very important fix, to be deployed ASAP!
Reproduce steps in CI
- Start a CI instance with a local objectstore (easy to then play with object in the local filesystem).
- Put all drives down
- Queue 30k files for archival, this will create 2 shards:
- first one with 25k files (full)
- second one with the remaining 5k files
- delete the first shard referenced in the
ArchiveQueueToTransferForUser
object - Put one drive back up
The tape drive will loop indefinitely between a short Start
status and Up
status.
Important note
The fix assumes that a deleted shard was emptied before, it should not take any action if a timeout takes place when checking the shard existence.
Edited by Julien Leduc