Retrieve disk system backpressure
Problem to solve
When retrieving, CTA can fill up its target filesystem(s). This new feature will prevent committing the retrieve sessions beyond the target disk system capabilities.
Intended users
This is a core functionality affecting users of retrieve in all scenarios (user-initiated and repack).
Further details
Proposal
In CTA, retrieves are queued by tape VID. Within this queue, requests from several users (repack included), targeting different disk systems can coexist. This new feature will add:
-
Retrieve request tagging by targeted disk system. This will be a configurable dispatch system, created by the operator. We currently expect one disk system per instance, plus the repack buffer so the list should have < 10 elements.
- We can dispatch by regexp.
- The tags will be added to the retrieve request pointer in the retrieve queue shard, in order to decide which requests to pop without accessing the individual request object.
-
Disk system free space and committed space will be tracked in a central new object. One object per disk system will be created to reduce contention.
-
Backstop algorithm is:
- When retrieve requests are being popped, the space needed for each of then will be reserved in the central disk space tracker.
- If no space is available for some requests, they will be skipped.
- If no request can be popped due to space, the queue for this VID will be paused for some time (15 minutes?).
- The free space in the disk system can be updated regularly in the same process (typically 1 every minute).
- As the space reporting in EOS has a delay, and as the user writing to the buffer is not controlled, we have to add a margin on top the this accounting (roughly, max input bandwidth * delay).
- This mechanism is sufficient to prevent mounting a tape if there is no space: the tape session will be started, but popping not a single job will trigger the empty mount protection and the session will end without any physical tape movement.