[maintenance] Fix Queue Cleanup Runner Queue Reservation (!916) · Merge requests · CTA / CTA

See: #1142 (closed) for whole picture

Description

When a Tape state is modified into REPACKING we have an intermediate step, the REPACKING_PENDING state. When issuing the tape transition the RetrieveQueue (if any) will be marked for Cleanup, this is, setting the Cleanup flag in the . At this point we have 2 possible situations, no user requests for that tape, in that case the transition is immediate, no work needed. In the other scenario we have user requests. In that case, we must either re-queue the requests into a different VID (if we have more than one copy for the files and they are available) or report the requests as failed back to the disk buffer. The Queue Cleanup Runner (QCR) (executed as part of the maintenance process) is the responsible of this task.
1 queue will only be cleaned up by 1 QCR at a time. To achieve this goal we populate the cleanupInfo struct of the involved RetrieveQueueToTransfer, this struct contains a field for the agent to be accounted as the one cleaning the queue, populating this field means reserving the queue for that agent. Additionally, for cleanup after possible failures we register the queue in the OwnedObjects list of the Maintenance Process running the Cleanup of the queue. The garbageCollect method of the RetrieveQueue class has been implemented (was not being used up until now), running the garbage collection of a retrieve queue means clearing the reservation struct.

We also create a new and empty ToReportQueue where we will move the jobs that we need to report as failed, this one is also registered into the owned objects of the agent and the CleanupInfo struct populated. We hold the lock on the root entry for the creation of the queue, the addition to ownership and the population of the CleanupInfo, this is necessary to avoid race conditions with the trimEmptyQueues function that might be executed during mount scheduling process or during the deletion of failed requests via cta-admin[1]. As we do not have a transaction concept in the object store we have to take into account that a server can crash at any point. The order of the actions prevents any problems with the intermediate steps of the reservation:
1. Queue creation. If we only create the empty queue it will be either cleaned up by trimEmptyQueues or reserved by another runner, it depends on what happens first. For the second case to happen the sequence of events is: crash -> Agent garbage collected -> ToTransfer queue cleanup info is cleared -> Another QCR reserves the queue (if the previous ToReportQueue exists it will grab that one, if not, it will create a a new one).
2. Adding to agent ownership, same situation as before, but the garbage collection of the agent will also clear the ToReport's queue cleanup info, ONLY IF the registered agent is the same as the one being garbage collected. The reservation only takes into account the ToTransfer's queue cleanup info so it might happen that: GC of ToTransfer queue -> New reservation overwrites the reservation of the ToReport queue -> GC of the ToReport queue; this way we guarantee a consistent state of the revervation structs (although it is not critical as the reservation only takes into account the ToTransfer one).

[1] Alternatively we could modify the trimEmptyQueues function so that it only deletes the queue the caller is interested in deleting and not all the queues that are empty.

Other design constraints:

A RetrieveQueueToTransfer that has been marked for cleanup will be skipped by the drives scheduling a mount.
A RetrieveQueueToReport that has been marked for cleanup will be skipped by the DiskReporter.
The only allowed agent to operate queues being cleaned up is the one that got registered for that task. The cleanup execution takes place right after the reservation and once finished the ToTransfer queue is removed as it no longer holds any objects and we clear the CleanupInfo from the ToReportQueue (or delete it if we requeued all jobs) to allow the DiskReporter to operator on it and report the failures.
The job moving operations form the QCR perspective is synchronous and should be completed in one go. In case that the process dies, the GarbageCollector will take care of clearing the resrevation information from both queues so that another QCR can pick them up. If the process gets stuck, we have no way of detecting it as the Maintenance Process has a dedicated thread to report the agent hearbeat and that one will go on forever. There is nothing in the code that can make the process get stuck, only locking operations but, for deadlocks, we have a 4 min timeout on the rados call.
If we managed to move all requests to a different user queue the ToReport queue will be empty so we must delete it.
getRetrieveQueuesCleanupInfo only returns the queues that need to be cleaned up.
A lock has been added in the OStoreDB::getNextRetrieveJobsToReportBatch(). This is to prevent the deletion of the queue before it is marked for cleanup. The queue reservation holds the global lock -> creates the queue -> locks the queue; the problem arises when the getNext function locks the newly created queue before the reservation function locks the queue (this happens because it fetches the root entry contents without locking). We should be safe as both operations do not take much time and we do not have that many .

Checklist

Documentation reflects the changes made.
Merge Request title is clear, concise, and suitable as a changelog entry. See this link

References

Closes #1173 (closed)

Edited Jun 06, 2025 by Pablo Oliver Cortes

[maintenance] Fix Queue Cleanup Runner Queue Reservation

Description

Checklist

References

Merge request reports