Snippets Groups Projects

Modify 'Scheduler::getNextMount' not to get stuck on broken queues

We have had several issues related to missing shards on retrieve/archive queues:

There is a fix on the way for this particular error:

#500 (closed)

However, there was another weakness shown by these operational issues:

Selecting the next mount with 'Scheduler::getNextMount' is deterministic, given a set of inputs.
A failing queue will keep being selected, as long as it is the highest priority queue.
This can starve the system and result in a single point of failure
- Eg: A single broken archive queue will cause all archivals to get stalled

This needs to be fixed!

Some ideas for discussion:

Flag any queue that has been selected by a tape server. Do not allow it to be re-selected for N seconds.
Pseudo-randomize queue selection.
Other?

Designs

Child items 0

No child items are currently assigned. Use child items to break down this issue into smaller parts.

Activity

Joao Afonso added Needs Discussion Object Store + 1 deleted label 1 year ago

added Needs Discussion Object Store + 1 deleted label
Joao Afonso assigned to @afonso 1 year ago

assigned to @afonso
Joao Afonso removed Object Store label 1 year ago

removed Object Store label
Joao Afonso removed Needs Discussion label 1 year ago

removed Needs Discussion label
Joao Afonso @afonso · 1 year ago

Author Owner

We decided not to implement this for now, since the problems that caused the queues to fail have been fixed.

If we start seeing queue-related issues again, we can reopen.
Joao Afonso closed 1 year ago

closed
Joao Afonso added cta: Scheduler label 8 months ago

added cta: Scheduler label
Joao Afonso removed 1 deleted label 8 months ago

removed 1 deleted label

Please register or sign in to reply

Epic

None

Labels

None

Milestone

None

Iteration

None

Weight

None

Due date

None

Health status

None

Confidentiality

Confidentiality controls have moved to the issue actions menu () at the top of the page.

0 Participants