Skip to content

Modify 'Scheduler::getNextMount' not to get stuck on broken queues

We have had several issues related to missing shards on retrieve/archive queues:

There is a fix on the way for this particular error:

However, there was another weakness shown by these operational issues:

  • Selecting the next mount with 'Scheduler::getNextMount' is deterministic, given a set of inputs.
  • A failing queue will keep being selected, as long as it is the highest priority queue.
  • This can starve the system and result in a single point of failure
    • Eg: A single broken archive queue will cause all archivals to get stalled

This needs to be fixed!

Some ideas for discussion:

  • Flag any queue that has been selected by a tape server. Do not allow it to be re-selected for N seconds.
  • Pseudo-randomize queue selection.
  • Other?