Queues with cleanup heartbeat above zero are not being picked for cleanup
This relates with the following ops ticket: https://gitlab.cern.ch/cta/operations/-/issues/1009
We have observed that queues with a cleanup heartbeat value over zero are not being picked up for cleanup. Like this one example from I73190
:
{
"vid": "I73190",
"retrievequeueshards": [
{
"address": "RetrieveQueueShard-I73190-DriveProcess-I3600922-tpsrv076.cern.ch-27894-20230304-16:12:54-0-1058",
"shardjobscount": "1213",
"shardbytescount": "4178131604954",
"minfseq": "1393",
"maxfseq": "5245"
}
],
"prioritymap": [
{
"value": "150",
"count": "1213"
}
],
"minretrieverequestagemap": [
{
"value": "86400",
"count": "1213"
}
],
"mountpolicynamemap": [
{
"value": "ctacms",
"count": "1213"
}
],
"activityMap": [],
"retrievejobstotalsize": "4178131604954",
"retrievejobscount": "1213",
"oldestjobcreationtime": "1674587441",
"mapsrebuildcount": "6",
"maxshardsize": "25000",
"sleepForFreeSpaceSince": "0",
"diskSystemSleptFor": "",
"sleepTime": "0",
"youngestjobcreationtime": "1678085093",
"cleanupInfo": {
"doCleanup": true,
"assignedAgent": "Maintenance-tpsrv038.cern.ch-38326-20230223-17:16:39-0",
"heartbeat": "2"
}
}
The tape is stuck on REPACKING_PENDING
and is not being picked up for cleanup, even though the heartbeat
field has not been updated for a long time...
This seems to have been cause by the following code segment:
// Check if heartbeat has been updated, which means that another agent is still tracking it
if (rq.getQueueCleanupAssignedAgent().has_value()) {
if (rq.getQueueCleanupHeartbeat() != cleanupHeartBeatValue.has_value() ? cleanupHeartBeatValue.value() : 0) {
throw RetrieveQueueNotReservedForCleanup("Another agent is alive and cleaning up the queue. Skipping it.");
}
}
The priority between the operators !=
and a?b:c
was not taken into account. Instead of running the ternary conditional first (as it was supposed to), it's running the difference !=
. Because the heartbeat value is 2
, this will always cause the RetrieveQueueNotReservedForCleanup
exception to be thrown and the queue to be ignored from cleanup.
In addition, we see that m_heartbeatCheck[queue.vid].heartbeat
is only being updated here:
This value should be updated, internally, every time a new heartbeat value is detected in the queue object.