Repacking a tape when retrieve is faster than archival -> starving drives
I launched a 2 tape repack and it lead to a lot of mounts/dismounts.
Here is the details of the queues evolutions and archive mounts. I highlighted the consumers and the starving drives that are dismounting:
This basically happens when the retrieve drive is slightly faster than the archiving ones:
- the queue reaches 500GB before the consumers are finished
- this triggers a new mount that consumes it
- when the drive with a mounted tape has no more work, it finds an empty queue, starves and dismount
It can basically happens with 1 tape being repacked: our tape drives are very close in reads and write performance which means that this phenomenon is very likely to happen.
The other side effect is that it increases the entropy of the tape files during repack as a single tape content will be split on 2 separate tapes.
This is a distributed starvation issue, plenty of solutions to choose from as the current phenomenon comes from the greedy algorithm of the consumers (archive processes) that drain the full queue every time they gather some jobs and the fact that they allow it to fill without reacting before their plate is fully empty.
We are granting a new tape mount just because the retrieve process is slightly faster (a few MB/s is enough).
Would be good to simulate the behavior we currently have with various producer/consumer speeds and possible implementations.