Skip to content

Safely handle empty shards in object store

As discussed in https://gitlab.cern.ch/cta/operations/-/issues/1190, we are getting some severe problems (queues unmounted, files failing to archive) due to empty shard objects.

Instead of aborting the request popping --- when facing the missing shard --- the object store should simply log it as an ERROR and continue popping the remaining objects.

This is a very important fix, to be deployed ASAP!

Reproduce steps in CI

  1. Start a CI instance with a local objectstore (easy to then play with object in the local filesystem).
  2. Put all drives down
  3. Queue 30k files for archival, this will create 2 shards:
    • first one with 25k files (full)
    • second one with the remaining 5k files
  4. delete the first shard referenced in the ArchiveQueueToTransferForUser object
  5. Put one drive back up

The tape drive will loop indefinitely between a short Start status and Up status.

Important note

The fix assumes that a deleted shard was emptied before, it should not take any action if a timeout takes place when checking the shard existence.

Edited by Julien Leduc