CTA logs a lot of messages and perhaps not all of these messages are useful. Review which messages are logged most frequently and decide if, for example, some of them could be moved from INFO to DEBUG level.
Designs
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Do we need to log a million debug messages every day in production? We could switch the default level to INFO and remove these from the logs.
5 "DEBUG" "Created threads in DiskWriteThreadPool::DiskWriteThreadPool" 5 "DEBUG" "Deleted threads in DiskWriteThreadPool::~DiskWriteThreadPool" 5 "DEBUG" "Finishing RecallReportPacker thread" 5 "DEBUG" "Finishing RecallTaskInjector thread" 5 "DEBUG" "In BackendRados::BackendRados(): created rados contexts" 5 "DEBUG" "In DriveHandler::createCatalogue(): will connect to catalogue." 5 "DEBUG" "In DriveHandler::createCatalogue(): will get catalogue login information." 5 "DEBUG" "In DriveHandler::processEvent(): Got a peer disconnect: closing socket and waiting for SIGCHILD" 5 "DEBUG" "In DriveHandler::runChild(): will connect to object store backend." 5 "DEBUG" "In DriveHandler::runChild(): will create agent entry. Enabling leaving non-empty agent behind." 5 "DEBUG" "In DriveHandler::runChild(): will create scheduler." 5 "DEBUG" "In DriveHandler::runChild(): will ping scheduler." 5 "DEBUG" "In RecallReportPacker::reportEndOfSession(), pushing a report." 5 "DEBUG" "In RecallReportPacker::WorkerThread::run(): all disk threads are finished, telling the mount that Disk threads are complete" 5 "DEBUG" "No more files to read from tape" 5 "DEBUG" "Starting read session cleanup. Signalled end of session to task injector." 5 "DEBUG" "Starting RecallReportPacker thread" 5 "DEBUG" "Starting RecallTaskInjector thread" 5 "DEBUG" "WARNING: In DriveHandler::processRunning(): total bytes moved going backwards" 8 "DEBUG" "In RecallJobInjector::WorkerThread::run(): got empty list, but not last call. NoOp." 11 "DEBUG" "Disk space reservation for next job batch" 11 "DEBUG" "In RetrieveMount::reserveDiskSpace(): reservation request." 14 "DEBUG" "WARNING: In OStoreDB::fetchMountInfo(): failed to lock/fetch an archive queue for user. Skipping it." 17 "DEBUG" "In cta::RetrieveMount::flushAsyncSuccessReports(): deleted complete retrieve jobs." 17 "DEBUG" "In RetrieveMount::releaseDiskSpace(): reservation release request." 18 "DEBUG" "RecallJobInjector:run: about to call client interface" 30 "DEBUG" "In RecallReportPacker::ReportDriveStatus::execute(): reporting drive status." 30 "DEBUG" "In RecallReportPacker::reportDriveStatus(), pushing a report." 50 "DEBUG" "DiskWriteWorkerThread exiting: no more work" 52 "DEBUG" "In ContainerTraits<ArchiveQueue,C>::getLockedAndFetchedNoCreate(): could not de-referenced missing queue from root entry: already done." 55 "DEBUG" "RdbmsCatalogue::updateTapeDriveStatistics(): It didn't update statistics" 60 "DEBUG" "BackendRados::BackendRados() about to create a new context" 60 "DEBUG" "BackendRados::BackendRados() context created. About to set namespace." 60 "DEBUG" "BackendRados::BackendRados() namespace set. About to test access." 98 "DEBUG" "In Scheduler::getNextMount(): No valid mount found." 164 "DEBUG" "In DiskReportRunner::runOnePass(): ready to process archive reports." 172 "DEBUG" "About to open disk file for writing" 172 "DEBUG" "In RecallReportPacker::reportCompletedJob(), pushing a report." 172 "DEBUG" "Successfully opened the tape file" 177 "DEBUG" "WARNING: In OStoreDB::fetchMountInfo(): failed to lock/fetch a retrieve queue. Skipping it." 207 "DEBUG" "In RecallReportPacker::WorkerThread::run(): Got a new report." 227 "DEBUG" "going to report" 2718 "DEBUG" "Async delete of archiveRequest complete" 2718 "DEBUG" "Will start async delete archiveRequest" 2890 "DEBUG" "In AgentReference::appyAction(): added object to ownership (by batch)." 2890 "DEBUG" "In AgentReference::appyAction(): removed object from ownership (by batch)." 5310 "DEBUG" "In Scheduler::getNextMountDryRun(): No valid mount found." 5517 "DEBUG" "WARNING: In OStoreDB::fetchMountInfo(): failed to lock/fetch an archive queue for repack. Skipping it." 6415 "DEBUG" "Computed new timeout" 16523 "DEBUG" "In Scheduler::sortAndGetTapesForMountInfo(): Will consider potential mount" 47942 "DEBUG" "No new mount found. (sleeping 10 seconds)" 56227 "DEBUG" "In MaintenanceHandler::exceptionThrowingRunChild(): About to do a maintenance pass." 270742 "DEBUG" "In Scheduler::sortAndGetTapesForMountInfo(): tapeAlreadyInUse found." 404914 "DEBUG" "In Scheduler::sortAndGetTapesForMountInfo(): Removing potential mount not passing criteria"
ERROR messages look mostly useful and there are not too many of them.
Stack trace is not a particularly useful reason why something failed (often appears in the monitoring, this message does not explain what the underlying cause was).
1 "ERROR" "RecallReportPacker::EndofSessionWithErrors has been reported but NO error was detected during the process" 1 "ERROR" "Tape thread complete for writing" 2 "ERROR" "End of recall session with error(s)" 2 "ERROR" "Failed to open tape file for writing" 2 "ERROR" "[FileReader::position] - Reading file mark at the end of file header: Failed ST read in DriveGeneric::readFileMark Errno=5: Input/output error" 2 "ERROR" "[FileReader::position] - Reading HDR2: Failed ST read with crc32c in DriveGeneric::readExactBlock Errno=5: Input/output error" 2 "ERROR" "[FileReader::position] - Reading UHL1: Failed ST read with crc32c in DriveGeneric::readExactBlock Errno=5: Input/output error" 2 "ERROR" "In RequestProc::ExecuteAction(): RSP_ERR_USER: In OStoreDB::getRepackInfo(): No repack request for this VID." 8 "ERROR" "SCSI error in positionToLogicalObject: status=0 host_status=0xe driver_status=0: SCSI command failed with host_status: Unknown host status code: 0xe" 13 "ERROR" "Exception while reading a file" 48 "ERROR" "In DriveGeneric::readBlock: Failed ST read (with checksum) Errno=5: Input/output error" 63 "ERROR" "SCSI error in positionToLogicalObject: status=0x2 host_status=0 driver_status=0x8: SCSI command failed with status CHECK CONDITION: Sense Information: Unit Attention: Power on, reset, or bus device reset occurred" 78 "ERROR" "Failed to open tape file for reading" 126 "ERROR" "Error reading a file in TapeReadFileTask" 126 "ERROR" "File writing to disk failed" 126 "ERROR" "In RecallReportPacker::ReportError::execute(): failing retrieve job after exception." 126 "ERROR" "Task failed: counting another error for this session" 2888 "ERROR" "Stack trace"
The reason Stack trace is so many errors is that there is one log line per stack trace frame. In reality there were 100-200 exceptions. This creates a lot of noise in the monitoring.
I have left the exception messages at ERROR (or CRIT) level, but changed the Stack trace messages to INFO level. This way they do not pollute the monitoring, but remain in the logs for forensic purposes.
WARN messages, the only one that stands out is the last one, Ignoring request to delete archive file because it does not exist in the catalogue. This seems to happen a lot.
Possible cause: CLOSEW is triggered, event is sent to CTA. Then file is deleted for some reason. Either file is bad (checksum or size) and FST deletes it, or user triggers a delete. This sends the DELETE event to CTA, which gets processed before the CLOSEW event as it is in a different queue. As the file has not been archived yet, this triggers the warning. (But we don't have a corresponding number of warnings when the archive event gets popped and the file to be archived no longer exists?)
The PostgreSQL SchedulerDB will allow better management of deleting in-flight requests and should eliminate this warning. In the meantime, is it useful or should we demote it to DEBUG?
1 "WARN" "Aborting recall mount startup: empty mount" 1 "WARN" "Lost database connection" 1 "WARN" "Notified client of end session with error" 2 "WARN" "Tape alert detected" 10 "WARN" "In OStoreDB::RetrieveMount::flushAsyncSuccessReports(): async deletion failed because the object does not exist anymore. " 63 "WARN" "In RAOManager::queryRAO(), failed to perform the RAO algorithm, will perform a linear RAO." 188 "WARN" "In OStoreDB::fetchMountInfo(): fetched an archive for user queue and that lasted more than 1 second." 344082 "WARN" "Ignoring request to delete archive file because it does not exist in the catalogue"
There are messages at INFO level which are Recall order of FSEQs using RAO: xxxx .... These messages should be changed so that the order of files to recall is a parameter, not the text of the MSG. Also move it to DEBUG level.
That leaves 222 unique messages at INFO level, 16 million messages in total.
1.5 million messages are In RequestMessage::process(): processing SSI event. This is a useful message, perhaps it would be better to specify which event is being processed in the message text (rather than in the parameter as at present).
We do already have one message when the SSI command completes:
72 "INFO" "In RequestMessage::processABORT_PREPARE(): canceled retrieve request." 140 "INFO" "In RequestMessage::processCLOSEW(): ignoring zero-length file." 284 "INFO" "In RequestMessage::process(): Admin command succeeded: drive ls" 284 "INFO" "In RequestMessage::process(): Admin command succeeded: version " 39801 "INFO" "In RequestMessage::process(): Admin command succeeded: " 84676 "INFO" "In RequestMessage::processPREPARE(): queued file for retrieve." 371433 "INFO" "In RequestMessage::processDELETE(): archive file deleted." 519747 "INFO" "In RequestMessage::processCLOSEW(): queued file for archive." 519927 "INFO" "In RequestMessage::processCREATE(): assigning new archive file ID."1495996 "INFO" "In RequestMessage::process(): processing SSI event"
Perhaps some GarbageCollector messages could be moved to DEBUG level:
12 "INFO" "In GarbageCollector::OwnedObjectSorter::fetchOwnedObjects(): skipping garbage collection of now gone object." 68 "INFO" "In GarbageCollector::cleanupDeadAgent(): agent entry removed." 68 "INFO" "In GarbageCollector::cleanupDeadAgent(): will cleanup dead agent." 798 "INFO" "In GarbageCollector::cleanupDeadAgent(): agent already deleted when trying to lock it. Skipping it." 8423 "INFO" "In GarbageCollector::OwnedObjectSorter::lockFetchAndUpdateArchiveJobs(): requeued archive job." 120668 "INFO" "In GarbageCollector::trimGoneTargets(): removed now gone agent." 122280 "INFO" "In GarbageCollector::acquireTargets(): started tracking an untracked agent"
Some OStoreDb messages could be moved to DEBUG level:
2 "INFO" "In OStoreDB::trimEmptyQueues(): deleted empty retrieve queue." 4 "INFO" "In OStoreDB::RepackRetrieveSuccessesReportBatch::report(): Processed a batch of reports." 8 "INFO" "In OStoreDB::trimEmptyQueues(): deleted empty archive queue." 50 "INFO" "In OStoreDB::fetchMountInfo(): fetched the drive register." 72 "INFO" "OStoreDB::cancelRetrieve(): will delete the retrieve request" 74 "INFO" "OStoreDB::cancelArchive(): will delete the archive request" 118 "INFO" "In OStoreDB::trimEmptyQueues(): will start trimming empty queues" 7053 "INFO" "In OStoreDB::getMountInfo(): success." 7219 "INFO" "In OStoreDB::RepackRequest::addSubrequests(): subrequest created." 7219 "INFO" "In OStoreDB::RepackRetrieveSuccessesReportBatch::report(), turned successful retrieve request in archive request." 8424 "INFO" "In OStoreDB::ArchiveJob::~ArchiveJob(): will leave the job owned after destruction." 31830 "INFO" "In OStoreDB::ArchiveMount::setJobBatchTransferred(): queued a batch of requests for reporting to user." 33898 "INFO" "In OStoreDB::ArchiveMount::setJobBatchTransferred(): set ArchiveRequests successful and queued for reporting." 60410 "INFO" "In OStoreDB::RetrieveMount::flushAsyncSuccessReports(), retrieve job successful" 60546 "INFO" "In OStoreDB::RetrieveJob::~RetrieveJob(): will leave the job owned after destruction." 70793 "INFO" "In OStoreDB::getMountInfoNoLock(): success." 84676 "INFO" "In OStoreDB::queueRetrieve_bottomHalf(): added job to queue (enqueueing finished)." 84676 "INFO" "In OStoreDB::queueRetrieve(): recorded request for queueing (enqueueing posted to thread pool)." 519747 "INFO" "In OStoreDB::queueArchive_bottomHalf(): Finished enqueueing request." 519747 "INFO" "In OStoreDB::queueArchive(): recorded request for queueing (enqueueing posted to thread pool)." 526162 "INFO" "In OStoreDB::setArchiveJobBatchReported(): deleted ArchiveRequest after completion and reporting." 526172 "INFO" "In OStoreDB::ArchiveMount::setJobBatchTransferred(): will queue request for reporting to user." 617878 "INFO" "In OStoreDB::queueArchive_bottomHalf(): added job to queue." 626190 "INFO" "In OStoreDB::ArchiveMount::setJobBatchTransferred(): received a job to be reported."
Scheduler messages seem pretty useful, they can stay as-is:
1 "INFO" "In Scheduler::expandRepackRequest(), repack request expanded" 1 "INFO" "In Scheduler::promoteRepackRequestsToToExpand(): Promoted repack request to "to expand"" 1 "INFO" "In Scheduler::queueRepack(): success." 2 "INFO" "In Scheduler::reportDriveConfig(): success." 7 "INFO" "In Scheduler::setDesiredDriveState(): success." 183 "INFO" "In Scheduler::getNextMount(): Selected next mount (archive)" 284 "INFO" "In Scheduler::getDriveStates(): success." 439 "INFO" "In Scheduler::getDriveState(): success." 463 "INFO" "In Scheduler::getNextMount(): Selected next mount (retrieve)" 657 "INFO" "In Scheduler::ping(): success." 665 "INFO" "In Scheduler::getQueuesAndMountSummaries(): success." 1960 "INFO" "In Scheduler::getNextMountDryRun(): Found a potential mount (archive)" 5709 "INFO" "In Scheduler::getNextMountDryRun(): Found a potential mount (retrieve)" 31829 "INFO" "In Scheduler::reportArchiveJobsBatch(): reported a batch of archive jobs." 40369 "INFO" "In Scheduler::authorizeAdmin(): success." 55116 "INFO" "In Scheduler::getNextMountDryRun(): logicalLibrary is disabled" 371433 "INFO" "In Scheduler::deleteArchive(): success." 526162 "INFO" "In Scheduler::reportArchiveJobsBatch(): report successful."
List of remaining messages that occurred more than 10,000 times:
13798 "INFO" "Finished creating tasks for migrating" 25556 "INFO" "Started Rados worker thread" 27351 "INFO" "In OracleCatalogue::copyArchiveFileToRecycleBinAndDelete: ArchiveFile moved to the recycle-bin." 27351 "INFO" "In RdbmsCatalogue::moveArchiveFileToRecycleLog(): ArchiveFile moved to the file-recycle-log." 31829 "INFO" "In DiskReportRunner::runOnePass(): did one round of archive reports." 31911 "INFO" "In ContainerTraits<ArchiveQueue_t,ArchiveQueue>::trimContainerIfNeeded(): deleted empty queue" 31976 "INFO" "In RootEntry::removeArchiveQueueAndCommit(): removed archive queue." 31983 "INFO" "In ContainerAlgorithms::referenceAndSwitchOwnership(): Requeued a batch of elements." 31994 "INFO" "In RootEntry::removeArchiveQueueAndCommit(): removed archive queue reference." 33762 "INFO" "Normal flush because thresholds was reached" 33898 "INFO" "Catalog updated for batch of jobs" 33898 "INFO" "In ArchiveMount::reportJobsBatchWritten(): recorded a batch of archive jobs in metadata." 46435 "INFO" "In Algorithms::popNextBatch(): elements retrieval complete." 46499 "INFO" "In Algorithms::popNextBatch(): did one round of elements retrieval." 53191 "INFO" "Retrieve job successfully deleted" 60519 "INFO" "File successfully read from tape" 60519 "INFO" "File successfully transfered to disk" 60560 "INFO" "Opened disk file for writing" 60573 "INFO" "Successfully positioned for reading" 61128 "INFO" "Created tasks for recalling a file" 83379 "INFO" "In Helpers::getLockedAndFetchedQueue<RetrieveQueue>(): Successfully found and locked a retrieve queue." 84676 "INFO" "Catalogue::prepareToRetrieve internal timings" 84676 "INFO" "Queued retrieve request" 141465 "INFO" "In ContainerTraits<ArchiveQueue_t,ArchiveQueue>::trimContainerIfNeeded(): could not delete a presumably empty queue" 519747 "INFO" "Queued archive request" 519927 "INFO" "Checked request and got next archive file ID" 626190 "INFO" "In cta::ArchiveMount::reportJobsBatchTransferred(), archive job successful." 626615 "INFO" "File successfully transmitted to drive" 627586 "INFO" "File successfully read from disk" 627942 "INFO" "Opened disk file for read" 632425 "INFO" "In Helpers::getLockedAndFetchedQueue<ArchiveQueue>(): Successfully found and locked an archive queue." 633783 "INFO" "Created tasks for migrating a file" 633783 "INFO" "Push a task into the DiskReadThreadPool" 683749 "INFO" "In MemQueue::sharedAddToNewQueue(): added batch of jobs to the queue." 683749 "INFO" "In SharedQueueLock::~SharedQueueLock(): unlocked the archive queue pointer."