Skip to content

Queues with cleanup heartbeat above zero are not being picked for cleanup

This relates with the following ops ticket: https://gitlab.cern.ch/cta/operations/-/issues/1009


We have observed that queues with a cleanup heartbeat value over zero are not being picked up for cleanup. Like this one example from I73190:

{
 "vid": "I73190",
 "retrievequeueshards": [
  {
   "address": "RetrieveQueueShard-I73190-DriveProcess-I3600922-tpsrv076.cern.ch-27894-20230304-16:12:54-0-1058",
   "shardjobscount": "1213",
   "shardbytescount": "4178131604954",
   "minfseq": "1393",
   "maxfseq": "5245"
  }
 ],
 "prioritymap": [
  {
   "value": "150",
   "count": "1213"
  }
 ],
 "minretrieverequestagemap": [
  {
   "value": "86400",
   "count": "1213"
  }
 ],
 "mountpolicynamemap": [
  {
   "value": "ctacms",
   "count": "1213"
  }
 ],
 "activityMap": [],
 "retrievejobstotalsize": "4178131604954",
 "retrievejobscount": "1213",
 "oldestjobcreationtime": "1674587441",
 "mapsrebuildcount": "6",
 "maxshardsize": "25000",
 "sleepForFreeSpaceSince": "0",
 "diskSystemSleptFor": "",
 "sleepTime": "0",
 "youngestjobcreationtime": "1678085093",
 "cleanupInfo": {
  "doCleanup": true,
  "assignedAgent": "Maintenance-tpsrv038.cern.ch-38326-20230223-17:16:39-0",
  "heartbeat": "2"
 }
}

The tape is stuck on REPACKING_PENDING and is not being picked up for cleanup, even though the heartbeat field has not been updated for a long time...


This seems to have been cause by the following code segment:

// Check if heartbeat has been updated, which means that another agent is still tracking it
if (rq.getQueueCleanupAssignedAgent().has_value()) {
  if (rq.getQueueCleanupHeartbeat() != cleanupHeartBeatValue.has_value() ? cleanupHeartBeatValue.value() : 0) {
    throw RetrieveQueueNotReservedForCleanup("Another agent is alive and cleaning up the queue. Skipping it.");
  }
}

The priority between the operators != and a?b:c was not taken into account. Instead of running the ternary conditional first (as it was supposed to), it's running the difference !=. Because the heartbeat value is 2, this will always cause the RetrieveQueueNotReservedForCleanup exception to be thrown and the queue to be ignored from cleanup.


In addition, we see that m_heartbeatCheck[queue.vid].heartbeat is only being updated here:

This value should be updated, internally, every time a new heartbeat value is detected in the queue object.

Edited by Joao Afonso