Fix "trying to unlock an unlocked lock" error
Reference:
As shown in https://gitlab.cern.ch/cta/operations/-/issues/1132, the maintenance process occasionally tries to unlock an unlocked lock, which end up throwing an exception that causes the whole process to crash:
Example:
Jul 27 13:58:40 tpsrv440.cern.ch cta-taped: LVL="ERROR" PID="20938" TID="20938" MSG="In MaintenanceHandler::exceptionThrowingRunChild(): received an exception. Backtrace follows." SubprocessName="maintenanceHandler" Message="In ScopedLock::checkLocked: trying to unlock an unlocked lock"
Jul 27 13:58:40 tpsrv440.cern.ch cta-taped: LVL="INFO" PID="20938" TID="20938" MSG="Stack trace" SubprocessName="maintenanceHandler" traceFrameNumber="0" traceFrame="/lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x6b) [0x7f38c12e684b]"
Jul 27 13:58:40 tpsrv440.cern.ch cta-taped: LVL="ERROR" PID="20938" TID="20938" MSG="In MaintenanceHandler::exceptionThrowingRunChild(): received an exception. Backtrace follows." SubprocessName="maintenanceHandler" Message="In ScopedLock::checkLocked: trying to unlock an unlocked lock"
[1690459120.804368000] Jul 27 13:58:40.804368 tpsrv440.cern.ch cta-taped: LVL="INFO" PID="20938" TID="20938" MSG="Selected the vid of the job to be queued for retrieve request." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-I4550844-tpsrv012.cern.ch-2170-20230727-09:40:31-0" gcAgentAddress="Maintenance-tpsrv440.cern.ch-20938-20230727-13:56:37-0" objectAddress="RepackSubRequest-Maintenance-tpsrv485.cern.ch-64896-20230726-17:16:15-0-2876" fileId="1636211334" copyNb="1" tapeVid="I44469" fSeq="2850"
[1690459120.824409000] Jul 27 13:58:40.824409 tpsrv440.cern.ch cta-taped: LVL="INFO" PID="20938" TID="20938" MSG="In Helpers::getLockedAndFetchedQueue<RetrieveQueue>(): Successfully found and locked a retrieve queue." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-I4550844-tpsrv012.cern.ch-2170-20230727-09:40:31-0" gcAgentAddress="Maintenance-tpsrv440.cern.ch-20938-20230727-13:56:37-0" objectAddress="RepackSubRequest-Maintenance-tpsrv485.cern.ch-64896-20230726-17:16:15-0-2876" attemptNb="1" queueName="RepackRequest-Frontend-ctaproductionfrontend02.cern.ch-21269-20230726-15:51:59-0-59330" queueType="JobsToReportToRepackForSuccess" queueObject="RetrieveQueueToReportToRepackForSuccess-RepackRequest-Frontend-ctaproductionfrontend02.cern.ch-21269-20230726-15:51:59-0-59330-Maintenance-tpsrv440.cern.ch-20938-20230727-13:56:37-0-0" rootFetchNoLockTime="0.000336" rootRelockExclusiveTime="0.009826" rootRefetchTime="0.000330" addOrGetQueueandCommitTime="0.006732" rootUnlockExclusiveTime="0.000633" queueLockTime="0.000923" queueFetchTime="0.000301"
[1690459120.833261000] Jul 27 13:58:40.833261 tpsrv440.cern.ch cta-taped: LVL="INFO" PID="20938" TID="20938" MSG="In ContainerAlgorithms::referenceAndSwitchOwnership(): Requeued a batch of elements." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-I4550844-tpsrv012.cern.ch-2170-20230727-09:40:31-0" gcAgentAddress="Maintenance-tpsrv440.cern.ch-20938-20230727-13:56:37-0" objectAddress="RepackSubRequest-Maintenance-tpsrv485.cern.ch-64896-20230726-17:16:15-0-2876" C="RetrieveQueueToReportToRepackForSuccess" tapeVid="RepackRequest-Frontend-ctaproductionfrontend02.cern.ch-21269-20230726-15:51:59-0-59330" containerAddress="RetrieveQueueToReportToRepackForSuccess-RepackRequest-Frontend-ctaproductionfrontend02.cern.ch-21269-20230726-15:51:59-0-59330-Maintenance-tpsrv440.cern.ch-20938-20230727-13:56:37-0-0" queueFilesBefore="0" queueBytesBefore="0" queueFilesAfter="1" queueBytesAfter="46720112" queueLockFetchTime="0.019207" queueProcessAndCommitTime="0.001822" asyncUpdateLaunchTime="0.000162" asyncUpdateCompletionTime="0.003632" requestsUpdatingTime="0.000044" queueUnlockTime="0.000751"
[1690459120.834224000] Jul 27 13:58:40.834224 tpsrv440.cern.ch cta-taped: LVL="INFO" PID="20938" TID="20938" MSG="In RetrieveRequest::garbageCollect(): requeued the repack retrieve request." SubprocessName="maintenanceHandler" agentAddress="DriveProcess-I4550844-tpsrv012.cern.ch-2170-20230727-09:40:31-0" gcAgentAddress="Maintenance-tpsrv440.cern.ch-20938-20230727-13:56:37-0" objectAddress="RepackSubRequest-Maintenance-tpsrv485.cern.ch-64896-20230726-17:16:15-0-2876" jobObject="RepackSubRequest-Maintenance-tpsrv485.cern.ch-64896-20230726-17:16:15-0-2876" fileId="1636211334" queueObject="RepackRequest-Frontend-ctaproductionfrontend02.cern.ch-21269-20230726-15:51:59-0-59330" copynb="1" tapeVid="I44469" sorterFlushingTime="0.028450"
[1690459120.836889000] Jul 27 13:58:40.836889 tpsrv440.cern.ch cta-taped: LVL="ERROR" PID="20938" TID="20938" MSG="In MaintenanceHandler::exceptionThrowingRunChild(): received an exception. Backtrace follows." SubprocessName="maintenanceHandler" Message="In ScopedLock::checkLocked: trying to unlock an unlocked lock"
Since the lock is already unlocked - which is the desired result - this exception should be caught instead of causing the entire process to crash. However, it's still important to check if there was not an intentional purpose behind this.
Tasks:
-
Confirm that there is no risk/problem in unlocking an already unlocked queue. -
Catch the exceptions of type objectstore::ObjectOpsBase::NotLocked
, without crashing the process, and log them as an error. -
Find why the maintenance process is trying to unlock an unlocked lock
and fix the underlying issue.