Postgres Scheduler DB - stops responding after processing few hundred thousand files and makes drive session timeout
TPSRV01-0
waits popping new task from the queue - waits until some apprear --> after more than 10 min the session times out
{"epoch_time":1734228780.841477587,"local_time":"2024-12-15T03:13:00+0100","hostname":"tpsrv01-0","program":"cta-taped","log_level":"INFO","pid":45434,"tid":45479,"message":"Normal flush because thresho
lds was reached","drive_name":"VDSTK01","instance":"CI","sched_backend":"postgres","thread":"TapeWrite","tapeDrive":"VDSTK01","tapeVid":"V00101","mountId":"2901","vo":"vo","tapePool":"ctasystest","media
Type":"LTO8","logicalLibrary":"VDSTK01","mountType":"ArchiveForUser","vendor":"vendor","capacityInBytes":12000000000000,"files":200,"bytes":25800,"flushTime":0.000456}
{"epoch_time":1734229693.552906466,"local_time":"2024-12-15T03:28:13+0100","hostname":"tpsrv01-0","program":"cta-taped","log_level":"WARN","pid":453,"tid":453,"message":"In DriveHandler::processTimeout(
): Killed subprocess.","drive_name":"VDSTK01","instance":"CI","sched_backend":"postgres","tapeVid":"V00101","mountType":"ArchiveForUser","mountId":"2901","volReqId":"2901","vendor":"vendor","vo":"vo","m
ediaType":"LTO8","tapePool":"ctasystest","logicalLibrary":"VDSTK01","capacityInBytes":"12000000000000","mountAttempted":"1","Error_diskOpenForRead":"4200","Info_fileSkipped":"4200","wasTapeMounted":"1",
"mountTime":"0.02804","positionTime":"0.002867","waitInstructionsTime":"6.133699","waitFreeMemoryTime":"0.0","waitDataTime":"0.0230019999999968","waitReportingTime":"0.434696999999991","checksumingTime"
:"0.00646900000000019","readWriteTime":"13.616332","flushTime":"0.021448","unloadTime":"0.0","unmountTime":"0.0","encryptionControlTime":"0.002798","transferTime":"20.235647","totalTime":"916.286811","d
eliveryTime":"916.28681","drainingTime":"0.0","dataVolume":"490200","filesCount":"8000","headerVolume":"3840000","payloadTransferSpeedMBps":"0.000534985327863679","driveTransferSpeedMBps":"0.00472581286
559629","tapeDrive":"VDSTK01","SessionState":"Running","SessionType":"Archive","TimeoutType":"DataMovement","SessionTypeWhenTimeoutDecided":"Archive","SessionStateWhenTimeoutDecided":"Running","LastData
MovementTime":4752412,"LastHeartbeatTime":4753298,"LastStateChangeTime":4752382,"Now":4753312,"ThisTimeout":4753312,"BeforeDataMovementTimeout_s":-0.003075964,"BeforeHeartbeatTimeout_s":46.158075218,"Su
bprocessId":45434}
TPSRV02-0
the second drive hit end of tape - and requeued the jobs in the queue it owned.
6456755
{"epoch_time":1734228773.155811912,"local_time":"2024-12-15T03:12:53+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44905,"message":"Created tasks for mi
6456755 grating a file","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MigrationTaskInjector","tapeDrive":"VDSTK02","tapeVid":"V00158","mountId":"2881","vo":"vo","tapePool":
6456755 "ctasystest","fileId":4296115814,"fSeq":80199,"path":"root://ctaeos.stress//eos/ctaeos/preprod/8a97b9f6-0cc1-460e-8156-84952b0887c7/57/test57008516_1734225078584923807?eos.lfn=fxid:118675"}
6456756
{"epoch_time":1734228773.155865605,"local_time":"2024-12-15T03:12:53+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44905,"message":"Push a task into the
6456756 DiskReadThreadPool","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MainThread","tapeDrive":"VDSTK02","tapeVid":"V00158","mountId":"2881","vo":"vo","tapePool":"ctasy
6456756 stest","threadID":9}
6456757
{"epoch_time":1734228773.155938679,"local_time":"2024-12-15T03:12:53+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44905,"message":"Created tasks for mi
6456757 grating a file","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MigrationTaskInjector","tapeDrive":"VDSTK02","tapeVid":"V00158","mountId":"2881","vo":"vo","tapePool":
6456757 "ctasystest","fileId":4296115816,"fSeq":80200,"path":"root://ctaeos.stress//eos/ctaeos/preprod/8a97b9f6-0cc1-460e-8156-84952b0887c7/57/test57008517_1734225078584923807?eos.lfn=fxid:118674"}
6456758
{"epoch_time":1734228773.155974834,"local_time":"2024-12-15T03:12:53+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44905,"message":"Push a task into the
6456758 DiskReadThreadPool","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MainThread","tapeDrive":"VDSTK02","tapeVid":"V00158","mountId":"2881","vo":"vo","tapePool":"ctasy
6456758 stest","threadID":9}
6456759
{"epoch_time":1734228773.156029040,"local_time":"2024-12-15T03:12:53+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44905,"message":"Finished creating ta
6456759 sks for migrating","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MigrationTaskInjector","tapeDrive":"VDSTK02","tapeVid":"V00158","mountId":"2881","vo":"vo","tapePoo
6456759 l":"ctasystest","numberOfFiles":4000}
6456760
{"epoch_time":1734228773.164779331,"local_time":"2024-12-15T03:12:53+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44905,"message":"In MigrationTaskInje
6456760 ctor::WorkerThread::run(): a task failed, indicating finish of run","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MigrationTaskInjector","tapeDrive":"VDSTK02","tape
6456760 Vid":"V00158","mountId":"2881","vo":"vo","tapePool":"ctasystest"}
6456761
{"epoch_time":1734228773.164864052,"local_time":"2024-12-15T03:12:53+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44905,"message":"Finishing MigrationT
6456761 askInjector thread","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MigrationTaskInjector","tapeDrive":"VDSTK02","tapeVid":"V00158","mountId":"2881","vo":"vo","tapePo
6456761 ol":"ctasystest"}
6456762
{"epoch_time":1734228773.164919695,"local_time":"2024-12-15T03:12:53+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44905,"message":"In MigrationTaskInje
6456762 ctor::WorkerThread::run(): popping extra request","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MigrationTaskInjector","tapeDrive":"VDSTK02","tapeVid":"V00158","mou
6456762 ntId":"2881","vo":"vo","tapePool":"ctasystest","lastCall":true}
6456781
{"epoch_time":1734228777.329862665,"local_time":"2024-12-15T03:12:57+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44893,"message":"Finishing of DiskRea
6456781 dWorkerThread","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"DiskRead","tapeDrive":"VDSTK02","tapeVid":"V00158","mountId":"2881","vo":"vo","tapePool":"ctasystest","
6456781 threadID":5,"threadReadWriteTime":0.915636,"threadWaitFreeMemoryTime":9.126846,"threadCheckingErrorTime":0.000577000000000006,"threadOpeningTime":1.501332,"threadTransferTime":11.546821,"threadT
6456781 otalTime":20.011764,"threadDataVolume":69402,"threadFileCount":538,"threadGlobalPayloadTransferSpeedMBps":0.00346806008705679,"threadAverageDiskPerformanceMBps":0.00601048548340708,"threadOpenRW
6456781 CloseToTransferTimeRatio":0.209318911239726}
6456782
{"epoch_time":1734228777.330026882,"local_time":"2024-12-15T03:12:57+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44893,"message":"Signalled to task in
6456782 jector the end of disk read threads","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"DiskRead","tapeDrive":"VDSTK02","tapeVid":"V00158","mountId":"2881","vo":"vo","ta
6456782 pePool":"ctasystest","threadID":5}
6456783
{"epoch_time":1734228777.330139508,"local_time":"2024-12-15T03:12:57+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44893,"message":"All the DiskReadWork
6456783 erThreads have completed","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MainThread","tapeDrive":"VDSTK02","tapeVid":"V00158","mountId":"2881","vo":"vo","tapePool":"
6456783 ctasystest","threadID":9,"poolReadWriteTime":10.081437,"poolWaitFreeMemoryTime":90.920163,"poolCheckingErrorTime":0.00582400000000006,"poolOpeningTime":15.380176,"poolTransferTime":116.390445,"p
6456783 oolRealTime":23.135654,"poolFileCount":5370,"poolDataVolume":692730,"poolGlobalPayloadTransferSpeedMBps":0.0299420971631059,"poolAverageDiskPerformanceMBps":0.00595177722707393,"poolOpenRWCloseT
6456783 oTransferTimeRatio":0.21876033724246}
6456784
{"epoch_time":1734228777.468396202,"local_time":"2024-12-15T03:12:57+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44901,"message":"In TapeWriteSingleTh
6456784 read::run(): Requeued all remaining unprocessed task jobs of the failed queue.","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"TapeWrite","tapeDrive":"VDSTK02","tape
6456784 Vid":"V00158","mountId":"2881","vo":"vo","tapePool":"ctasystest","requeuedTaskQueueJobs":6028}
6456785
{"epoch_time":1734228777.468771327,"local_time":"2024-12-15T03:12:57+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44901,"message":"Tape thread complete
6456785 for writing","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"TapeWrite","tapeDrive":"VDSTK02","tapeVid":"V00158","mountId":"2881","vo":"vo","tapePool":"ctasystest","
6456785 requeuedTaskQueueJobs":6028,"ErrorMessage":"End of migration due to tape full","type":"write","mountTime":0.022841,"positionTime":0.001772,"waitInstructionsTime":5.52636999999997,"checksumingTim
6456785 e":0.00707800000000038,"readWriteTime":11.488171,"waitDataTime":0.0185509999999995,"waitReportingTime":0.321236000000002,"flushTime":0.014457,"unloadTime":0.004482,"unmountTime":1.020282,"encryp
6456785 tionControlTime":0.003236,"transferTime":17.3758629999999,"totalTime":20.149504,"dataVolume":667188,"headerVolume":2866560,"files":5972,"payloadTransferSpeedMBps":0.0331118820592308,"driveTransf
6456785 erSpeedMBps":0.175376426139323,"status":"error"}
6456786
{"epoch_time":1734228777.469024758,"local_time":"2024-12-15T03:12:57+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44903,"message":"Reported end of sess
6456786 ion with error to client","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MainThread","tapeDrive":"VDSTK02","vo":"vo"}
6456787
{"epoch_time":1734228778.521468798,"local_time":"2024-12-15T03:12:58+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":44849,"tid":44849,"message":"MigrationMemoryManag
6456787 er destruction : all memory blocks have been deleted","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MainThread","tapeDrive":"VDSTK02","tapeVid":"V00158","mountId":"
6456787 2881","vo":"vo","tapePool":"ctasystest"}
then it started a new session and it looked for a mount:
6456796
{"epoch_time":1734228778.546470894,"local_time":"2024-12-15T03:12:58+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":453,"tid":453,"message":"Subprocess handler will
6456796 fork","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","tapeVid":"V00158","mountType":"ArchiveForUser","mountId":"2881","volReqId":"2881","vendor":"vendor","vo":"vo","mediaType"
6456796 :"LTO8","tapePool":"ctasystest","logicalLibrary":"VDSTK02","capacityInBytes":"12000000000000","mountAttempted":"1","Error_diskOpenForRead":"800","Info_fileSkipped":"800","Info_tapeFilledUp":"1",
6456796 "wasTapeMounted":"1","mountTime":"0.022841","positionTime":"0.001772","waitInstructionsTime":"5.52636999999997","waitFreeMemoryTime":"0.0","waitDataTime":"0.0185509999999995","waitReportingTime"
6456796 :"0.321236000000002","checksumingTime":"0.00707800000000038","readWriteTime":"11.488171","flushTime":"0.014457","unloadTime":"0.004482","unmountTime":"1.020282","encryptionControlTime":"0.003236
6456796 ","transferTime":"17.3758629999999","totalTime":"20.698455","deliveryTime":"20.698452","drainingTime":"0.0","dataVolume":"667188","filesCount":"5972","headerVolume":"2866560","payloadTransferSpe
6456796 edMBps":"0.0322337101972104","driveTransferSpeedMBps":"0.17072520630163","status":"success","SubprocessName":"drive:VDSTK02"}
6456797
{"epoch_time":1734228778.548443254,"local_time":"2024-12-15T03:12:58+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":45044,"tid":45044,"message":"In child process. Ru
6456797 nning child.","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","tapeVid":"V00158","mountType":"ArchiveForUser","mountId":"2881","volReqId":"2881","vendor":"vendor","vo":"vo","me
6456797 diaType":"LTO8","tapePool":"ctasystest","logicalLibrary":"VDSTK02","capacityInBytes":"12000000000000","mountAttempted":"1","Error_diskOpenForRead":"800","Info_fileSkipped":"800","Info_tapeFilled
6456797 Up":"1","wasTapeMounted":"1","mountTime":"0.022841","positionTime":"0.001772","waitInstructionsTime":"5.52636999999997","waitFreeMemoryTime":"0.0","waitDataTime":"0.0185509999999995","waitReport
6456797 ingTime":"0.321236000000002","checksumingTime":"0.00707800000000038","readWriteTime":"11.488171","flushTime":"0.014457","unloadTime":"0.004482","unmountTime":"1.020282","encryptionControlTime":"
6456797 0.003236","transferTime":"17.3758629999999","totalTime":"20.698455","deliveryTime":"20.698452","drainingTime":"0.0","dataVolume":"667188","filesCount":"5972","headerVolume":"2866560","payloadTra
6456797 nsferSpeedMBps":"0.0322337101972104","driveTransferSpeedMBps":"0.17072520630163","status":"success","SubprocessName":"drive:VDSTK02"}
6456798
{"epoch_time":1734228778.848612903,"local_time":"2024-12-15T03:12:58+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":45044,"tid":45044,"message":"In Scheduler::ping()
6456798 : success.","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","tapeVid":"V00158","mountType":"ArchiveForUser","mountId":"2881","volReqId":"2881","vendor":"vendor","vo":"vo","medi
6456798 aType":"LTO8","tapePool":"ctasystest","logicalLibrary":"VDSTK02","capacityInBytes":"12000000000000","mountAttempted":"1","Error_diskOpenForRead":"800","Info_fileSkipped":"800","Info_tapeFilledUp
6456798 ":"1","wasTapeMounted":"1","mountTime":"0.022841","positionTime":"0.001772","waitInstructionsTime":"5.52636999999997","waitFreeMemoryTime":"0.0","waitDataTime":"0.0185509999999995","waitReportin
6456798 gTime":"0.321236000000002","checksumingTime":"0.00707800000000038","readWriteTime":"11.488171","flushTime":"0.014457","unloadTime":"0.004482","unmountTime":"1.020282","encryptionControlTime":"0.
6456798 003236","transferTime":"17.3758629999999","totalTime":"20.698455","deliveryTime":"20.698452","drainingTime":"0.0","dataVolume":"667188","filesCount":"5972","headerVolume":"2866560","payloadTrans
6456798 ferSpeedMBps":"0.0322337101972104","driveTransferSpeedMBps":"0.17072520630163","status":"success","SubprocessName":"drive:VDSTK02","catalogueTime":0.251307,"schedulerDbTime":0.04079,"checkEnviro
6456798 nmentVariablesTime":4.9e-05}
6456799
{"epoch_time":1734228778.849166254,"local_time":"2024-12-15T03:12:58+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":45044,"tid":45044,"message":"In DriveHandler::res
6456799 etLogParams(): Watchdog log parameters cleared in child process.","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","SubprocessName":"drive:VDSTK02"}
6456800
{"epoch_time":1734228778.849291892,"local_time":"2024-12-15T03:12:58+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":453,"tid":453,"message":"In DriveHandler::process
6456800 Logs(): Watchdog log parameters cleared in parent process.","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres"}
6456801
{"epoch_time":1734228778.849441768,"local_time":"2024-12-15T03:12:58+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":45044,"tid":45044,"message":"DataTransferSession
6456801 made effective raw I/O capabilty to the tape","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MainThread","tapeDrive":"VDSTK02","capabilities":"cap_sys_rawio=ep"}
6456802
{"epoch_time":1734228778.859990655,"local_time":"2024-12-15T03:12:58+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":45044,"tid":45044,"message":"In TapeDrivesCatalog
6456802 ueState::updateDriveStatus(): Updating drive status.","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MainThread","tapeDrive":"VDSTK02","next_status":"UP","update_dri
6456802 veName":"VDSTK02","update_host":"tpsrv02-0","update_logicalLibrary":"VDSTK02","update_logicalLibraryDisabled":null,"update_sessionId":null,"update_bytesTransferedInSession":null,"update_filesTra
6456802 nsferedInSession":null,"update_sessionStartTime":null,"update_sessionElapsedTime":null,"update_mountStartTime":null,"update_transferStartTime":null,"update_unloadStartTime":null,"update_unmountS
6456802 tartTime":null,"update_drainingStartTime":null,"update_downOrUpStartTime":1734228778,"update_probeStartTime":null,"update_cleanupStartTime":null,"update_startStartTime":null,"update_shutdownTime
6456802 ":null,"update_mountType":"NO_MOUNT","update_driveStatus":"2","update_desiredUp":false,"update_desiredForceDown":false,"update_reasonUpDown":null,"update_currentVid":"","update_ctaVersion":null,
6456802 "update_currentPriority":null,"update_currentActivity":"","update_currentTapePool":"","update_nextMountType":"NO_MOUNT","update_nextVid":null,"update_nextTapePool":null,"update_nextPriority":nul
6456802 l,"update_nextActivity":null,"update_devFileName":null,"update_rawLibrarySlot":null,"update_currentVo":"","update_nextVo":null,"update_diskSystemName":null,"update_reservedBytes":null,"update_re
6456802 servationSessionId":null,"update_physicalLibraryName":null,"update_userComment":null,"update_creationLog":null,"update_lastModificationLog":"(username=NO_USER host=tpsrv02-0 time=1734228778)","u
6456802 pdate_physicalLibraryDisabled":null}
after getting the info from the catalogue suddenly all paused until the fetchMountInfo returned:
6456803
{"epoch_time":1734229694.490058644,"local_time":"2024-12-15T03:28:14+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":45044,"tid":45044,"message":"In RelationalDB::fet
6456803 chMountInfo(): populated TapeMountDecisionInfo with potential mounts.","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MainThread","tapeDrive":"VDSTK02","totalTime":9
6456803 15.622591,"fetchMountPolicyCatalogueTime":0.016533,"getScheduledJobSummariesTime":915.603612,"fetchDriveStateTime":0.002378,"getDriveStatesTime":6.4e-05}
6456804
{"epoch_time":1734229694.490999140,"local_time":"2024-12-15T03:28:14+0100","hostname":"tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":45044,"tid":45044,"message":"In RelationalDB::get
6456804 MountInfoNoLock(): success.","drive_name":"VDSTK02","instance":"CI","sched_backend":"postgres","thread":"MainThread","tapeDrive":"VDSTK02","fetchNoLockTime":0.000477,"fetchMountInfoTime":915.623
6456804 604}
This shows that the count(*) does take too much time and we need to implement the counters !
Edited by Jaroslav Guenther