Modify 'Scheduler::getNextMount' not to get stuck on broken queues
We have had several issues related to missing shards on retrieve/archive queues:
- https://gitlab.cern.ch/cta/operations/-/issues/1190
- https://gitlab.cern.ch/cta/operations/-/issues/1201
There is a fix on the way for this particular error:
However, there was another weakness shown by these operational issues:
- Selecting the next mount with 'Scheduler::getNextMount' is deterministic, given a set of inputs.
- A failing queue will keep being selected, as long as it is the highest priority queue.
-
This can starve the system and result in a single point of failure
- Eg: A single broken archive queue will cause all archivals to get stalled
This needs to be fixed!
Some ideas for discussion:
- Flag any queue that has been selected by a tape server. Do not allow it to be re-selected for N seconds.
- Pseudo-randomize queue selection.
- Other?