test-client is periodically failing. Specifically, taped is coredumping every once in a while. There seem to be two different core dumps going on (hopefully related to the same issue), 1 for the maintenance process and 1 for the taped process. Below is a stacktrace for the taped failure; it seems to be related to XRootD:
(gdb) bt full#0 0x00007f3519c7094c in __pthread_kill_implementation () from /lib64/libc.so.6No symbol table info available.#1 0x00007f3519c23646 in raise () from /lib64/libc.so.6No symbol table info available.#2 0x00007f3519c0d7f3 in abort () from /lib64/libc.so.6No symbol table info available.#3 0x00007f3519c0e130 in __libc_message.cold () from /lib64/libc.so.6No symbol table info available.#4 0x00007f3519c7a9f7 in malloc_printerr () from /lib64/libc.so.6No symbol table info available.#5 0x00007f3519c7ddd2 in _int_malloc () from /lib64/libc.so.6No symbol table info available.#6 0x00007f3519c7e809 in malloc () from /lib64/libc.so.6No symbol table info available.#7 0x00007f3519c5ab44 in _IO_file_doallocate () from /lib64/libc.so.6No symbol table info available.#8 0x00007f3519c687e0 in _IO_doallocbuf () from /lib64/libc.so.6No symbol table info available.#9 0x00007f3519c66155 in __GI__IO_file_seekoff () from /lib64/libc.so.6No symbol table info available.#10 0x00007f3519c641b3 in fseeko64 () from /lib64/libc.so.6No symbol table info available.#11 0x00007f3519d1e5ec in __nss_files_fopen () from /lib64/libc.so.6No symbol table info available.#12 0x00007f3519d2267d in _nss_files_gethostbyaddr_r () from /lib64/libc.so.6No symbol table info available.#13 0x00007f3519d02ddf in gethostbyaddr_r@@GLIBC_2.2.5 () from /lib64/libc.so.6No symbol table info available.#14 0x00007f3519d0bdb0 in getnameinfo () from /lib64/libc.so.6No symbol table info available.#15 0x00007f3517cd4ff9 in XrdNetAddrInfo::Resolve() () from /lib64/libXrdUtils.so.3No symbol table info available.#16 0x00007f3517cd5583 in XrdNetAddrInfo::Name(char const*, char const**) () from /lib64/libXrdUtils.so.3No symbol table info available.#17 0x00007f3517359315 in XrdSecProtocolsssObject () from /lib64/libXrdSecsss-5.soNo symbol table info available.#18 0x00007f35173652b0 in XrdSecPManager::Get(char const*, XrdNetAddrInfo&, XrdSecBuffer&, XrdOucErrInfo*) () from /lib64/libXrdSec-5.soNo symbol table info available.#19 0x00007f3517363e18 in XrdSecGetProtocol () from /lib64/libXrdSec-5.soNo symbol table info available.#20 0x00007f351a319f4e in XrdCl::XRootDTransport::GetCredentials (this=this@entry=0x7f33fc02caf0, credentials=@0x7f3501fec538: 0x0, hsData=hsData@entry=0x7f34c00048b0, info=info@entry=0x7f33fc0026d0) at /usr/src/debug/xrootd-5.6.1-1.el9.x86_64/xrootd/src/XrdCl/XrdClXRootDTransport.cc:2643 protocolName = <error: Cannot access memory at address 0x1a1a1a1a1a1a1a1a> log = 0x1d07ca0 ei = {_vptr.XrdOucErrInfo = 0x7f351a477220 <vtable for XrdOucErrInfo+16>, ErrInfo = {static Max_Error_Len = 2048, static Path_Offset = 1024, user = 0x7f351a41fead "", ucap = 0, code = 0, message = "Plugin loaded secsss v5.6.1 from sec.protocol libXrdSecsss-5.so\000\000\000\000\000\000\000\000\000p\275\376\0015\177\000\000\220R>\0275\177\000\000\000\305\327\0275\177\000\000p\310\327\0275\177\000\000\242\000\000\000\000\000\000\000\346\000\000\000\000\000\000\200", '\000' <repeats 48 times>, "\242\000\000\000\000\000\000\000\346\000\000\000\000\000\000\200", '\000' <repeats 49 times>..., static uVMask = 65535, static uAsync = -2147483648, static uUrlOK = 1073741824, static uMProt = 536870912, static uReadR = 268435456, static uIPv4 = 134217728, static uIPv64 = 67108864, static uPrip = 33554432, static uLclF = 16777216, static uRedirFlgs = 8388608, static uEcRedir = 4194304}, ErrCB = 0x0, {ErrCBarg = 139864536256112, ErrEnv = 0x7f34c0000e70}, mID = 0, dOff = -1, reserved = 0, dataBuff = 0x0} authHandler = 0x7f3517363d60 <XrdSecGetProtocol> secuidc = <optimized out> secgidc = <optimized out> secuid = <optimized out> secgid = <optimized out> uidSetter = {pFsUid = <optimized out>, pFsGid = <optimized out>, pStreamName = <optimized out>, pPrevFsUid = -1, pPrevFsGid = -1, pOk = true} srvAddrInfo = @0x7f33fc028ef0: {<XrdNetAddrInfo> = {static noPort = 1, static noPortRaw = 2, static old6Map4 = 4, static prefipv4 = 8, IP = {v6 = {sin6_family = 2, sin6_port = 17924, sin6_flowinfo = 1191244810, sin6_addr = {__in6_u = {__u6_addr8 = "\000\000\000\000\000\000\000\000k for: r", __u6_addr16 = {0, 0, 0, 0, 8299, 28518, 14962, 29216}, __u6_addr32 = {0, 0, 1868963947, 1914714738}}}, sin6_scope_id = 980709231}, v4 = {sin_family = 2, sin_port = 17924, sin_addr = {s_addr = 1191244810}, sin_zero = "\000\000\000\000\000\000\000"}, Addr = { sa_family = 2, sa_data = "\004F\n\364\000G\000\000\000\000\000\000\000"}}, {sockAddr = 0x7f33fc028ef0, unixPipe = 0x7f33fc028ef0}, hostName = 0x0, addrLoc = {Country = "\000e", Region = 0 '\000', Locale = 0 '\000', TimeZone = -128 '\200', Flags = 0 '\000', Speed = 0, Latitude = 0, Longtitude = 0}, addrSize = 16, protType = 6 '\006', protFlgs = 0 '\000', sockNum = 0, protName = 0x0, static isTLS = 1 '\001'}, static PortInSpec = -2147483648}#21 0x00007f351a31a82a in XrdCl::XRootDTransport::DoAuthentication (this=this@entry=0x7f33fc02caf0, hsData=hsData@entry=0x7f34c00048b0, info=info@entry=0x7f33fc0026d0) at /usr/src/debug/xrootd-5.6.1-1.el9.x86_64/xrootd/src/XrdCl/XrdClXRootDTransport.cc:2414 urlParams = std::map with 8 elements = {["eos.injection"] = "1", ["eos.lfn"] = "fxid:12", ["eos.rgid"] = "0", ["eos.ruid"] = "0", ["eos.space"] = "default", ["eos.workflow"] = "retrieve_written", ["oss.asize"] = "15360", ["xrdcl.requuid"] = "5c6a17d3-f8ab-45b2-a8ce-6f8ceba7409a"} pars = 0x7f34c0002ea0 "&P=krb5,host/eos-mgm-0.eos-mgm.dev.svc.cluster.local@TEST.CTA&P=sss,0.+13:/etc/eos.keytab&P=unix\a" it = <optimized out> authBuffLen = <optimized out> st = {<XrdCl::Status> = {status = 0, code = 0, errNo = 0}, pMessage = ""} log = <optimized out> sInfo = @0x7f33fc028230: {status = XrdCl::XRootDStreamInfo::AuthSent, pathId = 0 '\000'} credentials = 0x0 protocolName = "" msg = <optimized out> req = <optimized out> reqBuffer = <optimized out>#22 0x00007f351a31d1f3 in XrdCl::XRootDTransport::HandShakeMain (this=this@entry=0x7f33fc02caf0, handShakeData=handShakeData@entry=0x7f34c00048b0, channelData=...) at /usr/src/debug/xrootd-5.6.1-1.el9.x86_64/xrootd/src/XrdCl/XrdClXRootDTransport.cc:586 st = {<XrdCl::Status> = {status = 0, code = 1, errNo = 0}, pMessage = ""} info = 0x7f33fc0026d0 sInfo = @0x7f33fc028230: {status = XrdCl::XRootDStreamInfo::AuthSent, pathId = 0 '\000'}#23 0x00007f351a31d541 in XrdCl::XRootDTransport::HandShake (this=0x7f33fc02caf0, handShakeData=0x7f34c00048b0, channelData=...) at /usr/src/debug/xrootd-5.6.1-1.el9.x86_64/xrootd/src/XrdCl/XrdClXRootDTransport.cc:486 info = 0x7f33fc0026d0 scopedLock = {mtx = 0x7f33fc002848}#24 0x00007f351a39e528 in XrdCl::AsyncSocketHandler::HandleHandShake (this=this@entry=0x7f33fc024ca0, msg=std::unique_ptr<XrdCl::Message> = {...}) at /usr/src/debug/xrootd-5.6.1-1.el9.x86_64/xrootd/src/XrdCl/XrdClAsyncSocketHandler.cc:583 st = {<XrdCl::Status> = {status = 31904, code = 464, errNo = 0}, pMessage = "\000\000\000\000\000\000\000\000\330\317\376\0015\177", '\000' <repeats 11 times>, "\317\376\0015\177\000\000\240L\002\3743\177", '\000' <repeats 26 times>, ' ' <repeats 32 times>, "0\320\376\0015\177\000\000\001\000\000\000\000\000\000\000\240L\002\3743\177\000\000\240|\320\001", '\000' <repeats 12 times>, "\020\300\000\3743\177\000\000\001\000\000\000\000\000\000\000e\3559\0325\177", '\000' <repeats 34 times>, "% ++! \"*"...} waitSeconds = <optimized out>#25 0x00007f351a39e933 in XrdCl::AsyncSocketHandler::OnReadWhileHandshaking (this=0x7f33fc024ca0) at /usr/src/debug/xrootd-5.6.1-1.el9.x86_64/xrootd/src/XrdCl/XrdClAsyncSocketHandler.cc:571 st = {<XrdCl::Status> = {status = 0, code = 0, errNo = 0}, pMessage = ""}#26 0x00007f351a39ed65 in XrdCl::AsyncSocketHandler::EventRead (type=1 '\001', this=0x7f33fc024ca0) at /usr/src/debug/xrootd-5.6.1-1.el9.x86_64/xrootd/src/XrdCl/XrdClAsyncSocketHandler.cc:248No locals.#27 XrdCl::AsyncSocketHandler::EventRead (type=1 '\001', this=0x7f33fc024ca0) at /usr/src/debug/xrootd-5.6.1-1.el9.x86_64/xrootd/src/XrdCl/XrdClAsyncSocketHandler.cc:234No locals.#28 XrdCl::AsyncSocketHandler::Event (this=0x7f33fc024ca0, type=1 '\001') at /usr/src/debug/xrootd-5.6.1-1.el9.x86_64/xrootd/src/XrdCl/XrdClAsyncSocketHandler.cc:224No locals.#29 0x00007f351a306c56 in (anonymous namespace)::SocketCallBack::Event (this=0x7f33fc028f50, chP=<optimized out>, cbArg=<optimized out>, evFlags=<optimized out>) at /usr/src/debug/xrootd-5.6.1-1.el9.x86_64/xrootd/src/XrdCl/XrdClPollerBuiltIn.cc:83 ev = 1 '\001' log = 0x1d07ca0#30 0x00007f3517c911e7 in XrdSys::IOEvents::Poller::CbkXeq(XrdSys::IOEvents::Channel*, int, int, char const*) () from /lib64/libXrdUtils.so.3No symbol table info available.#31 0x00007f3517c924dc in XrdSys::IOEvents::PollE::Dispatch(XrdSys::IOEvents::Channel*, unsigned int) () from /lib64/libXrdUtils.so.3No symbol table info available.#32 0x00007f3517c926d8 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /lib64/libXrdUtils.so.3No symbol table info available.#33 0x00007f3517c8f0bd in XrdSys::IOEvents::BootStrap::Start(void*) () from /lib64/libXrdUtils.so.3No symbol table info available.#34 0x00007f3517c97698 in XrdSysThread_Xeq () from /lib64/libXrdUtils.so.3No symbol table info available.#35 0x00007f3519c6ec02 in start_thread () from /lib64/libc.so.6No symbol table info available.#36 0x00007f3519cf3c40 in clone3 () from /lib64/libc.so.6No symbol table info available.
We also get the following in the logs (this has been going on for a while now; what is causing this?)
{"epoch_time":1736241852.422857289,"local_time":"2025-01-07T10:24:12+0100","hostname":"tpsrv01-0","program":"cta-taped","log_level":"ERROR","pid":553,"tid":553,"message":"In Scheduler::reportRetrieveJobsBatch(): failed to report.","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","fileId":4294977310,"reportType":"FailureReport","exceptionMSG":"In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Server responded with an error: [3000] Unable to trigger workflow - no workflow defined for <workflow>.<event> [EINVAL] &mgm.pcmd=event&mgm.fid=2734&mgm.logid=cta&mgm.event=sync::retrieve_failed&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&mgm.errmsg=; Invalid argument code:400 errNo:3000 status:1"}
It seems like it doesn't recognize the sync::retrieve_failed event
The stack trace above is incomplete, so it's not very useful to know where the problem is. Please try compiling everything with frame pointers and debugging info enabled to fix it.
{"epoch_time":1736241852.422857289,"local_time":"2025-01-07T10:24:12+0100","hostname":"tpsrv01-0","program":"cta-taped","log_level":"ERROR","pid":553,"tid":553,"message":"In Scheduler::reportRetrieveJobsBatch(): failed to report.","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","fileId":4294977310,"reportType":"FailureReport","exceptionMSG":"In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Server responded with an error: [3000] Unable to trigger workflow - no workflow defined for <workflow>.<event> [EINVAL] &mgm.pcmd=event&mgm.fid=2734&mgm.logid=cta&mgm.event=sync::retrieve_failed&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&mgm.errmsg=; Invalid argument code:400 errNo:3000 status:1"}
It may simply be because we are trying to retrieve (or do any tape-related operation) on a file that has no workflow defined.
It may even be intentionally caused, as part of a system test that looks for some workflow related behaviour.
As such, I think this is probably unrelated to the segfaults.
The other segfault we are periodically getting is in the taped maintenance process. It fails with this log messages:
{"epoch_time":1737972756.175063849,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"ERROR","pid":373,"tid":373,"message":"In MaintenanceHandler::exceptionThrowingRunChild(): received an exception. Backtrace follows.","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected"}{"epoch_time":1737972756.175145222,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":0,"traceFrame":"/lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x6b) [0x7ff94ae383d1]"}{"epoch_time":1737972756.175192213,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":1,"traceFrame":"/lib64/libctacommon.so.0(cta::exception::Exception::Exception(std::basic_string_view<char, std::char_traits<char> >, bool)+0x91) [0x7ff94ae393e9]"}{"epoch_time":1737972756.175226788,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":2,"traceFrame":"/lib64/libctacommon.so.0(cta::exception::UserError::UserError(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)+0x53) [0x7ff94ae3a8dd]"}{"epoch_time":1737972756.175260295,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":3,"traceFrame":"/lib64/libctacatalogue.so.0(cta::catalogue::UserSpecifiedANonExistentTape::UserError(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)+0x3e) [0x7ff95163a346]"}{"epoch_time":1737972756.175296382,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":4,"traceFrame":"/lib64/libctacatalogue.so.0(cta::catalogue::RdbmsTapeCatalogue::modifyTapeState(cta::common::dataStructures::SecurityIdentity const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cta::common::dataStructures::Tape::State const&, std::optional<cta::common::dataStructures::Tape::State> const&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)+0x764) [0x7ff95162cefa]"}{"epoch_time":1737972756.175332600,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":5,"traceFrame":"/lib64/libctacatalogue.so.0(+0x3e72b2) [0x7ff9516a42b2]"}{"epoch_time":1737972756.175365718,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":6,"traceFrame":"/lib64/libctacatalogue.so.0(+0x3ec8e1) [0x7ff9516a98e1]"}{"epoch_time":1737972756.175437793,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":7,"traceFrame":"/lib64/libctacatalogue.so.0(cta::catalogue::TapeCatalogueRetryWrapper::modifyTapeState(cta::common::dataStructures::SecurityIdentity const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cta::common::dataStructures::Tape::State const&, std::optional<cta::common::dataStructures::Tape::State> const&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)+0x7d) [0x7ff9516a4347]"}{"epoch_time":1737972756.175475387,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":8,"traceFrame":"/lib64/libctaobjectstore.so.0(cta::objectstore::QueueCleanupRunner::runOnePass(cta::log::LogContext&)+0x1154) [0x7ff95267c8e6]"}{"epoch_time":1737972756.175506106,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":9,"traceFrame":"/usr/bin/cta-taped() [0x4df177]"}{"epoch_time":1737972756.175559578,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":10,"traceFrame":"/usr/bin/cta-taped() [0x4deb1c]"}{"epoch_time":1737972756.175588778,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":11,"traceFrame":"/usr/bin/cta-taped() [0x4e36b9]"}{"epoch_time":1737972756.175617462,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":12,"traceFrame":"/usr/bin/cta-taped() [0x4e2663]"}{"epoch_time":1737972756.175648731,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":13,"traceFrame":"/usr/bin/cta-taped() [0x49c9a0]"}{"epoch_time":1737972756.175677623,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":14,"traceFrame":"/usr/bin/cta-taped() [0x49c6e0]"}{"epoch_time":1737972756.175706132,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":15,"traceFrame":"/usr/bin/cta-taped() [0x49c1a1]"}{"epoch_time":1737972756.175754054,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":16,"traceFrame":"/usr/bin/cta-taped() [0x486f05]"}{"epoch_time":1737972756.175784682,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":17,"traceFrame":"/usr/bin/cta-taped() [0x487ab8]"}{"epoch_time":1737972756.175813470,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":18,"traceFrame":"/lib64/libc.so.6(+0x295d0) [0x7ff94a0b25d0]"}{"epoch_time":1737972756.175842450,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":19,"traceFrame":"/lib64/libc.so.6(__libc_start_main+0x80) [0x7ff94a0b2680]"}{"epoch_time":1737972756.175872275,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","SubprocessName":"maintenanceHandler","exceptionMessage":"Cannot modify the state of the tape V00101 because it does not exist or because a recent state change has been detected","traceFrameNumber":20,"traceFrame":"/usr/bin/cta-taped() [0x486975]"}{"epoch_time":1737972756.181889607,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"ERROR","pid":373,"tid":373,"message":"In Agent::deleteAndUnregisterSelf: agent still owns objects. Here is a part of the list.","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","agentObject":"Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0","objects":"RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","startIndex":0,"endIndex":0,"totalObjects":1}{"epoch_time":1737972756.184497556,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"CRIT","pid":373,"tid":373,"message":"In BackendPopulator::~BackendPopulator(): error deleting agent (cta::exception::Exception). Backtrace follows.","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195"}{"epoch_time":1737972756.184563945,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":0,"traceFrame":"/lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x6b) [0x7ff94ae383d1]"}{"epoch_time":1737972756.184601864,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":1,"traceFrame":"/lib64/libctacommon.so.0(cta::exception::Exception::Exception(std::basic_string_view<char, std::char_traits<char> >, bool)+0x91) [0x7ff94ae393e9]"}{"epoch_time":1737972756.184634152,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":2,"traceFrame":"/lib64/libctaobjectstore.so.0(cta::objectstore::Agent::AgentStillOwnsObjects::Exception(std::basic_string_view<char, std::char_traits<char> >, bool)+0x4c) [0x7ff9524f8440]"}{"epoch_time":1737972756.184665362,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":3,"traceFrame":"/lib64/libctaobjectstore.so.0(cta::objectstore::Agent::removeAndUnregisterSelf(cta::log::LogContext&)+0x67c) [0x7ff9524f6994]"}{"epoch_time":1737972756.184695661,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":4,"traceFrame":"/lib64/libctaobjectstore.so.0(cta::objectstore::BackendPopulator::~BackendPopulator()+0x2d3) [0x7ff9526470cf]"}{"epoch_time":1737972756.184730833,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":5,"traceFrame":"/lib64/libctaobjectstore.so.0(cta::objectstore::BackendPopulator::~BackendPopulator()+0x27) [0x7ff952647537]"}{"epoch_time":1737972756.184760156,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":6,"traceFrame":"/usr/bin/cta-taped() [0x4b767b]"}{"epoch_time":1737972756.184788188,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":7,"traceFrame":"/usr/bin/cta-taped() [0x4b397b]"}{"epoch_time":1737972756.184816069,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":8,"traceFrame":"/usr/bin/cta-taped() [0x4b8265]"}{"epoch_time":1737972756.184843828,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":9,"traceFrame":"/usr/bin/cta-taped() [0x4dfe00]"}{"epoch_time":1737972756.184871232,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":10,"traceFrame":"/usr/bin/cta-taped() [0x4deb1c]"}{"epoch_time":1737972756.184898745,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":11,"traceFrame":"/usr/bin/cta-taped() [0x4e36b9]"}{"epoch_time":1737972756.184929221,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":12,"traceFrame":"/usr/bin/cta-taped() [0x4e2663]"}{"epoch_time":1737972756.184956998,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":13,"traceFrame":"/usr/bin/cta-taped() [0x49c9a0]"}{"epoch_time":1737972756.184984529,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":14,"traceFrame":"/usr/bin/cta-taped() [0x49c6e0]"}{"epoch_time":1737972756.185011999,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":15,"traceFrame":"/usr/bin/cta-taped() [0x49c1a1]"}{"epoch_time":1737972756.185039753,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":16,"traceFrame":"/usr/bin/cta-taped() [0x486f05]"}{"epoch_time":1737972756.185067197,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":17,"traceFrame":"/usr/bin/cta-taped() [0x487ab8]"}{"epoch_time":1737972756.185094931,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":18,"traceFrame":"/lib64/libc.so.6(+0x295d0) [0x7ff94a0b25d0]"}{"epoch_time":1737972756.185122743,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":19,"traceFrame":"/lib64/libc.so.6(__libc_start_main+0x80) [0x7ff94a0b2680]"}{"epoch_time":1737972756.185176360,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":373,"tid":373,"message":"Stack trace","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","errorMessage":"In Agent::removeAndUnregisterSelf: agent (agentObject=Maintenance-cta-tpsrv02-0-373-20250127-10:58:22-0) still owns objects. Here's the first few: RetrieveRequest-Frontend-cta-frontend-0-239-20250127-10:58:05-0-30195","traceFrameNumber":20,"traceFrame":"/usr/bin/cta-taped() [0x486975]"}{"epoch_time":1737972756.320643766,"local_time":"2025-01-27T11:12:36+0100","hostname":"cta-tpsrv02-0","program":"cta-taped","log_level":"INFO","pid":371,"tid":371,"message":"In signal handler, received SIGCHLD and propagations to other handlers","drive_name":"VDSTK02","instance":"CI","sched_backend":"ceph","signal":"Child exited","senderPID":373,"senderUID":0}
In particular the CRIT message: "In BackendPopulator::~BackendPopulator(): error deleting agent (cta::exception::Exception)
Looking at the timestamp, it seems to happen somewhere at the start of Step 14. here:
######################################################################################################## 13. Testing 'Tape state change from ACTIVE to DISABLED - queue preserved (1 copy only)########################################################################################################Setting up V00101 queue as ACTIVE...1737972748 Waiting for tape V00101 state to change to ACTIVE: Seconds passed = 0Requesting file prepare -s...Checking that the request was queued...Changing V00101 queue to DISABLED...1737972749 Waiting for tape V00101 state to change to DISABLED: Seconds passed = 0Checking that the request was not modified on the queue...Queue preserved, as expected.Cleaning up request and queues...OK######################################################################################################## 14. Testing 'Tape state change from DISABLED to ACTIVE - queue preserved (1 copy only)########################################################################################################Setting up V00101 queue as DISABLED...1737972759 Waiting for tape V00101 state to change to DISABLED: Seconds passed = 0Requesting file prepare -s...Checking that the request was queued...Changing V00101 queue to ACTIVE...1737972760 Waiting for tape V00101 state to change to ACTIVE: Seconds passed = 0Checking that the request was not modified on the queue...Queue preserved, as expected.Cleaning up request and queues...OK
While some of the previous issues were resolved. We have some new sporadic failures. They are most likely unrelated to what was mentioned above as this time around we are reaching a timeout.
See e.g.
{"epoch_time":1738444859.653360818,"local_time":"2025-02-01T22:20:59+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"INFO","pid":69927,"tid":69982,"message":"Tape mounted for read/write access","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","thread":"TapeWrite","tapeDrive":"VDSTK01","tapeVid":"V00101","mountId":"25","vo":"vo","tapePool":"ctasystest","mediaType":"LTO8","logicalLibrary":"VDSTK01","mountType":"ArchiveForUser","vendor":"vendor","capacityInBytes":12000000000000,"drive_Slot":"smc0","MCMountTime":0.010462,"mode":"RW"}{"epoch_time":1738444859.682573525,"local_time":"2025-02-01T22:20:59+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"INFO","pid":343,"tid":343,"message":"In signal handler, received SIGCHLD and propagations to other handlers","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","signal":"Child exited","senderPID":69927,"senderUID":0}{"epoch_time":1738444859.682657454,"local_time":"2025-02-01T22:20:59+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"INFO","pid":343,"tid":343,"message":"Handler received SIGCHILD. Propagating to all handlers.","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","SubprocessName":"signalHandler"}{"epoch_time":1738444859.682678040,"local_time":"2025-02-01T22:20:59+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"INFO","pid":343,"tid":343,"message":"Propagated SIGCHILD.","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","SubprocessName":"signalHandler"}{"epoch_time":1738444859.682790551,"local_time":"2025-02-01T22:20:59+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"INFO","pid":343,"tid":343,"message":"Drive subprocess crashed. Will spawn a new one.","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","tapeDrive":"VDSTK01","subprocessPid":69927,"IfSignaled":true,"TermSignal":6,"CoreDump":128}{"epoch_time":1738444859.682824932,"local_time":"2025-02-01T22:20:59+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"INFO","pid":343,"tid":343,"message":"Tape session finished","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","tapeDrive":"VDSTK01","subprocessPid":69927,"IfSignaled":true,"TermSignal":6,"CoreDump":128,"Error_sessionKilled":1,"killSignal":6,"status":"failure"}
This is the coredump. It seems to have been caused by xrootd:
(gdb) bt#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44#1 0x00007ff17548dad3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78#2 0x00007ff175440686 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26#3 0x00007ff17542a833 in __GI_abort () at abort.c:79#4 0x00007ff17542b170 in __libc_message (fmt=fmt@entry=0x7ff1755bd47f "*** %s ***: terminated\n") at ../sysdeps/posix/libc_fatal.c:150#5 0x00007ff17551f87b in __GI___fortify_fail (msg=msg@entry=0x7ff1755bd425 "buffer overflow detected") at fortify_fail.c:24#6 0x00007ff17551e1e6 in __GI___chk_fail () at chk_fail.c:28#7 0x00007ff1737e4278 in strcpy ( __src=0x7ff126ff8e78 "\273\376\374\332\377\334ݻ\357\252\356\372\333\334ݻ\377\376\336\276\373\254뿾\316\335\373\254\354\fݻ\377\376\336\276\373\254뿾\316\335\373\254\353\277\336\336\336\277\b\237\254\v\377ڜ\fݻ\377\376\336\276\373\254\353\277ͭ\357\272\316\362\246l\fݻ\376\374\332\377\334ݻ\357\252\356\372\333\334ݻ\377\376\336\276\373\254뿾\316\335\373\254\354\fݻ\377\376\336\276\373\254뿾\316\335\373\254\353\277\336\336\336\277\b\237\246l\v\377\332f\233\274\332\353ډ1\357\217\372Ќݻ\377\376\336\276\373\254\353\277ͭ\357\272\316\362\273\246\230\315ۿ\357ͯ\375\315۾\372\256ﭽ\315ۿ\377\355\353\357\272λ\373\354\355", <incomplete sequence \337>..., __dest=0x7ff126ff8d10 "") at /usr/include/bits/string_fortified.h:79#8 XrdSecProtocolsss::getCredentials (this=0x7ff13c007bf0, parms=<optimized out>, einfo=0x7ff126ff90e0) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdSecsss/XrdSecProtocolsss.cc:688#9 0x00007ff175b048fc in XrdCl::XRootDTransport::GetCredentials (this=this@entry=0x7ff14002b690, credentials=@0x7ff126ff9998: 0x0, hsData=hsData@entry=0x7ff13c000df0, info=info@entry=0x7ff14002baf0) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdCl/XrdClXRootDTransport.cc:2662#10 0x00007ff175b07365 in XrdCl::XRootDTransport::DoAuthentication (this=0x7ff14002b690, hsData=0x7ff13c000df0, info=0x7ff14002baf0) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdCl/XrdClXRootDTransport.cc:2417#11 0x00007ff175afab93 in XrdCl::XRootDTransport::HandShakeMain (this=this@entry=0x7ff14002b690, handShakeData=handShakeData@entry=0x7ff13c000df0, channelData=...) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdCl/XrdClXRootDTransport.cc:586#12 0x00007ff175afb2be in XrdCl::XRootDTransport::HandShake (this=0x7ff14002b690, handShakeData=0x7ff13c000df0, channelData=...) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdCl/XrdClXRootDTransport.cc:486#13 0x00007ff175b77a59 in XrdCl::AsyncSocketHandler::HandleHandShake (this=this@entry=0x7ff14001f210, msg=std::unique_ptr<XrdCl::Message> = {...}) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdCl/XrdClAsyncSocketHandler.cc:583#14 0x00007ff175b7801b in XrdCl::AsyncSocketHandler::OnReadWhileHandshaking (this=0x7ff14001f210) at /usr/include/c++/11/bits/unique_ptr.h:172#15 0x00007ff175b7867d in XrdCl::AsyncSocketHandler::EventRead (type=1 '\001', this=0x7ff14001f210) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdCl/XrdClAsyncSocketHandler.cc:248#16 XrdCl::AsyncSocketHandler::EventRead (type=1 '\001', this=0x7ff14001f210) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdCl/XrdClAsyncSocketHandler.cc:234#17 XrdCl::AsyncSocketHandler::Event (this=0x7ff14001f210, type=1 '\001') at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdCl/XrdClAsyncSocketHandler.cc:224#18 0x00007ff175af243a in (anonymous namespace)::SocketCallBack::Event (this=0x7ff14001fb70, chP=<optimized out>, cbArg=<optimized out>, evFlags=<optimized out>) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdCl/XrdClPollerBuiltIn.cc:83#19 0x00007ff17419bedf in XrdSys::IOEvents::Poller::CbkXeq (this=this@entry=0x7ff14000b980, cP=cP@entry=0x7ff14001fb90, events=1, eNum=<optimized out>, eTxt=<optimized out>) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdSys/XrdSysIOEvents.cc:721#20 0x00007ff17419c590 in XrdSys::IOEvents::PollE::Dispatch (this=this@entry=0x7ff14000b980, cP=0x7ff14001fb90, pollEv=<optimized out>) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdSys/XrdSysIOEventsPollE.icc:274#21 0x00007ff17419c798 in XrdSys::IOEvents::PollE::Begin (this=0x7ff14000b980, syncsem=<optimized out>, retcode=<optimized out>, eTxt=<optimized out>) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdSys/XrdSysIOEventsPollE.icc:229#22 0x00007ff1741992c1 in XrdSys::IOEvents::BootStrap::Start (parg=0x7ff14f7f26b0) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdSys/XrdSysIOEvents.cc:149#23 0x00007ff1741a10bc in XrdSysThread_Xeq (myargs=0x7ff14000a020) at /usr/src/debug/xrootd-5.7.1-1.el9.x86_64/src/XrdSys/XrdSysPthread.cc:86#24 0x00007ff17548bd22 in start_thread (arg=<optimized out>) at pthread_create.c:443#25 0x00007ff175510d40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Specifically, the User and Grup appear to contain part of the log messages for some reason. There seems to be something fishy going on with the string length. The null terminator of the string is present, but for some reason the string doesn't stop there.
{"epoch_time":1739222061.204070637,"local_time":"2025-02-10T22:14:21+0100","hostname":"cta-tpsrv01-0","program":"cta-taped","log_level":"ERROR","pid":8294,"tid":8294,"message":"In MigrationReportPacker::~MigrationReportPacker(), still pending jobs report.","drive_name":"VDSTK01","instance":"CI","sched_backend":"ceph","thread":"MainThread","tapeDrive":"VDSTK01","mountId":"12","vo":"vo_repack","tapePool":"ctasystest"}
But this message may exist only because now we are logging when we still have remaining "pending jobs" in MigrationReportPacker , while before we were ignoring and not logging it.