The repack systemtest is failing relatively frequently by timing out on the last step:
Waiting for repack request on tape V00101 to be complete: Seconds passed = 599Timed out after 600 seconds waiting for tape V00101 to be repackedResult of Repack ls c.time repackTime c.user vid tapepool providedFiles totalFiles totalBytes selectedFiles filesToRetrieve filesToArchive failed status 2024-12-03 14:14 10m22s ctaadmin2 V00101 ctasystest 0 2143 32.9M 6429 0 200 0 Running DestinationVID NbFiles totalSizeV00102 2143 32916480V00106 2143 32916480V00108 1943 29844480
Note that it is seemingly somewhat random which tapes/files are problematic. E.g. here the 3rd tape is not even present:
Waiting for repack request on tape V00101 to be complete: Seconds passed = 599Timed out after 600 seconds waiting for tape V00101 to be repackedResult of Repack ls c.time repackTime c.user vid tapepool providedFiles totalFiles totalBytes selectedFiles filesToRetrieve filesToArchive failed status 2024-12-03 00:53 10m21s ctaadmin2 V00101 ctasystest 0 2143 32.9M 6429 0 3025 0 Running DestinationVID NbFiles totalSizeV00102 2143 32915339V00106 1261 19368960
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
This might (or might not be) related to the message too long error. This seems to be because of a really long buffer. Below is an excerpt from a test job:
While it may or may not be related, this seems like a bug regardless. This is a huge log entry that should probably not be being inserted as a single message. It seems that the stillOpenFileForThread entry is repeated many many times (194 times in this single send)
(gdb) info argsoss = @0x7f8e4adfc770: <incomplete type>fp = @0x7f8e4adfc700: { m_value = <error: Cannot access memory at address 0x7f8e5704224400>}(gdb) frame#0 0x00007f8e516dc215 in cta::log::operator<< (oss=..., fp=...) at /usr/src/debug/cta-5-dev.el9.x86_64/common/log/Logger.cpp:202202 in /usr/src/debug/cta-5-dev.el9.x86_64/common/log/Logger.cpp(gdb) up#1 0x00007f8e516dca66 in cta::log::Logger::createMsgBody (this=0x1efe320, logLevel="INFO", msg="In DriveHandlerProxy::setRefreshLoggerHandler(): Waiting for refresh logger signal.", params=std::__cxx11::list = {...}, pid=1382) at /usr/src/debug/cta-5-dev.el9.x86_64/common/log/Logger.cpp:276276 in /usr/src/debug/cta-5-dev.el9.x86_64/common/log/Logger.cpp(gdb) info argsthis = 0x1efe320logLevel = "INFO"msg = "In DriveHandlerProxy::setRefreshLoggerHandler(): Waiting for refresh logger signal."params = std::__cxx11::list = { [0] = { m_name = "SubprocessName", m_value = std::optional<std::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long, unsigned long, float, double, bool>> containing std::variant<std::string, long, unsigned long, float, double, bool> [index 0] = { "drive:VDSTK01" } }}pid = 1382
Params are fine in createMsgBody when operator<< is called, but params are not fine anymore in this function. Most likely a lifetime issue somewhere. Inspecting it further...
It seems that somewhere in the repack tests, we do the following:
kubectl -n${NAMESPACE}exec${EOS_MGM_POD}-c eos-mgm -- eos cp${pathOfFilesToInject[$i]}$bufferDirectory/`printf"%9d\n"$fseqFile | tr' ' 0`
This copy is performed on the EOS MGM pod itself. As any eos commands there are executed as root, the owner of the resulting files will be root as well. This would explain some of the weird log messages showing up regarding operation not permitted. I'll try to run this cp from the client pod instead and see if that helps
#7 0x00000000004d181f in std::__future_base::_State_baseV2::_S_check<std::__future_base::_State_baseV2> (__p=<error reading variable: Cannot access memory at address 0x8>) at /usr/include/c++/11/future:562#8 0x000000000051ce25 in std::promise<void>::_M_state (this=0x0) at /usr/include/c++/11/future:1373#9 0x00007fbde5226ab4 in std::promise<void>::set_exception (this=0x0, __p=...) at /usr/include/c++/11/future:1356#10 0x00007fbde4a4b84d in cta::objectstore::Sorter::executeArchiveAlgorithm<cta::objectstore::ArchiveQueueToTransferForRepack> (this=0x7ffe39560bd0, tapePool="systest3_repack", queueAddress="", jobs=std::__cxx11::list = {...}, lc=...) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/objectstore/Sorter.cpp:61#11 0x00007fbde4a43c62 in cta::objectstore::Sorter::dispatchArchiveAlgorithm (this=0x7ffe39560bd0, tapePool="systest3_repack", jobQueueType=@0x1e89340: cta::common::dataStructures::JobQueueType::JobsToTransferForRepack, queueAddress="", jobs=std::__cxx11::list = {...}, lc=...) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/objectstore/Sorter.cpp:78#12 0x00007fbde4a448b4 in cta::objectstore::Sorter::queueArchiveRequests (this=0x7ffe39560bd0, tapePool="systest3_repack", jobQueueType=@0x1e89340: cta::common::dataStructures::JobQueueType::JobsToTransferForRepack, archiveJobsInfos=std::__cxx11::list = {...}, lc=...) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/objectstore/Sorter.cpp:154#13 0x00007fbde4a4472e in cta::objectstore::Sorter::flushOneArchive (this=0x7ffe39560bd0, lc=...) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/objectstore/Sorter.cpp:138#14 0x00007fbde4a46bc1 in cta::objectstore::Sorter::flushAll (this=0x7ffe39560bd0, lc=...) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/objectstore/Sorter.cpp:379#15 0x00007fbde51f4ee9 in cta::OStoreDB::RepackRetrieveSuccessesReportBatch::report (this=0x229b670, lc=...) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/scheduler/OStoreDB/OStoreDB.cpp:3021#16 0x00007fbde513f460 in cta::Scheduler::RepackReportBatch::report (this=0x7ffe39560ea8, lc=...) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/scheduler/Scheduler.cpp:893#17 0x00007fbde512802c in cta::RepackReportThread::run (this=0x7ffe39561120) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/scheduler/RepackReportThread.cpp:33#18 0x00007fbde512bc2b in cta::RepackRequestManager::runOnePass (this=0x7ffe39561488, lc=..., repackMaxRequestsToExpand=2) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/scheduler/RepackRequestManager.cpp:66#19 0x00000000004e0279 in cta::tape::daemon::MaintenanceHandler::exceptionThrowingRunChild (this=0x1934580) at /usr/src/debug/cta-5-10122273git2b2ec94b.el9.x86_64/tapeserver/daemon/MaintenanceHandler.cpp:349
This goes accompanied with the following warning messages:
{"epoch_time":1738601109.056713118,"local_time":"2025-02-03T17:45:09+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"WARN","pid":317,"tid":317,"message":"In ContainerAlgorithms::referenceAndSwitchOwnership(): Encountered problems while requeuing a batch of elements","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","reportingType":"RetrieveSuccesses","C":"ArchiveQueueToTransferForRepack","tapepool":"systest3_repack","containerAddress":"ArchiveQueueToTransferForRepack-systest3_repack-Maintenance-cta-tpsrv01-0-317-20250203-17:34:51-0-11","queueJobsBefore":0,"queueBytesBefore":0,"queueJobsAfter":1207,"queueBytesAfter":18538810,"queueLockFetchTime":0.023534,"queueProcessAndCommitTime":0.012548,"asyncUpdateLaunchTime":0.142798,"asyncUpdateCompletionTime":2.449932,"requestsUpdatingTime":0.0225,"queueRecommitTime":0.048129,"queueUnlockTime":0.001352,"errorCount":293,"failedElementsAddresses":"RepackSubRequest-Maintenance-cta-tpsrv02-0-307-20250203-17:34:50-0-7529 RepackSubRequest-Maintenance-cta-tpsrv02-0-307-20250203-17:34:50-0-7535 RepackSubRequest-Maintenance-cta-tpsrv02-0-307-20250203-17:34:50-0-7539 ........ "}
This is related to other messages we get: In OStoreDB::RetrieveJob::~RetrieveJob(): will leave the job owned after destruction.. To my understanding there are two issues here:
The job should not be owned when the maintenance process collects it (this probably shouldn't happen)
As this goes wrong, we have a race condition in the reporting. While this is a bug, fixing this most likely won't fix the underlying issue
{"epoch_time":1738601071.519504996,"local_time":"2025-02-03T17:44:31+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"ERROR","pid":317,"tid":317,"message":"In OStoreDB::RepackArchiveReportBatch::report(): async file not deleted.","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","reportingType":"ArchiveSuccesses","fileId":4294967334,"subrequestAddress":"RepackSubRequest-Maintenance-cta-tpsrv02-0-307-20250203-17:34:50-0-7504","fileBufferURL":"root://ctaeos//eos/ctaeos/repack/V00101/000000007","exceptionMsg":"In XRootdDiskFileRemover::remove(), fail to remove file. [ERROR] Server responded with an error: [3010] Unable to remove file /eos/ctaeos/repack/V00101/000000007 by tident=cta.317:490@[::ffff:10.244.0.192]; Operation not permitted\n code:400 errNo:3010 status:1"}
Which might potentially be related to some leftovers of a previous test that were created with incorrect ownership settings. Doing a test where these files do not clash
So it seems that fixing one of the previous tests to prevent clashes fixes this particular error, but repack is still getting stuck at some point: see e.g. https://gitlab.cern.ch/cta/CTA/-/jobs/50528231
However, we need more information to get to the problem. The above was quickly extracted from a running test. I'll increase the timeout on the test and run it again to give a bit more time to debug the problem
We do have two crashes (and resulting core dumps) of the maintenance process in this particular job.
{"epoch_time":1738836109.903070039,"local_time":"2025-02-06T11:01:49+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"CRIT","pid":313,"tid":313,"message":"In BackendPopulator::~BackendPopulator(): error deleting agent (cta::exception::Exception). Backtrace follows.","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv01-0-313-20250206-10:54:46-0) still owns objects. Here's the first few: RepackSubRequest-Maintenance-cta-tpsrv02-0-319-20250206-10:54:46-0-1616 RepackSubRequest-Maintenance-cta-tpsrv02-0-319-20250206-10:54:46-0-1617 RepackSubRequest-Maintenance-cta-tpsrv02-0-319-20250206-10:54:46-0-1618 RepackSubRequest-Maintenance-cta-tpsrv02-0-319-20250206-10:54:46-0-1619 [... trimmed at 3 of 1553]"}
Which in turn might be related to?
{"epoch_time":1738836082.814181864,"local_time":"2025-02-06T11:01:22+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"INFO","pid":25166,"tid":26392,"message":"In OStoreDB::RetrieveJob::~RetrieveJob(): will leave the job owned after destruction.","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","agentObject":"DriveProcess-VDSTK01-cta-tpsrv01-0-25166-20250206-11:01:14-0","jobObject":"RepackSubRequest-Maintenance-cta-tpsrv02-0-319-20250206-10:54:46-0-3144"}