Skip to content

Allow cta-taped to retry a SQL query after a server failure in the middle of a fetch.

Summary

We have been observing the following error on several tape server daemons, which result in those drives being put down:

{"epoch_time":1757324855.329813757,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"ERROR","pid":1275785,"tid":1275785,"message":"Notified client of end session with error","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","devFilename":"/dev/tape/by-name/IBMLIB1-LTO9-F12C4R2","errorMessage":"next failed: next failed for SQL statement \n    SELECT \n      TAPE.VID AS VID,\n      MEDIA_TYPE.MEDIA_TYPE_NAME AS MEDIA_TYPE,\n      TAPE.VENDOR AS VENDOR,\n      LOGICAL_LIBRARY.LOGICAL_LIBRARY_NAME AS LOGICAL_LIBRARY_NAME,\n      TAPE_POOL.TAPE_POOL_NAME AS TAPE_POOL_NAME,\n      VIRTUAL_ORGANIZATION.VIRTUAL_ORGANIZATION_NAME AS VO,\n      TAPE.ENCRYPTION_KEY_NAME AS ENCRYPTION_KEY_NAME,\n      MEDIA_TYPE.CAPACITY_IN_BYTES AS CAPACITY_IN_BYTES,\n      TAPE.DATA_IN_BYTES AS DATA_IN_BYTES,\n      TAPE.NB_MASTER_FILES AS NB_MASTER_FILES,\n      TAPE.MASTER_DATA_IN_BYTES AS MASTER_DATA_IN_BYTES,\n      TAPE.LAST_FSEQ AS LAST_FSEQ,\n      TAPE.IS_FULL AS IS_FULL,\n      TAPE.DIRTY AS DIRTY,\n      TAPE.PURCHASE_ORDER AS PURCHASE_ORDER,\n      PHYSICAL_LIBRARY.PHYSICAL_LIBRARY_NAME AS PHYSICAL_LIBRARY_NAME,\n\n      TAPE.IS_FROM_CASTOR AS IS_FROM_CASTOR,\n\n      TAPE.LABEL_FORMAT AS LABEL_FORMAT,\n\n      TAPE.LABEL_DRIVE AS LABEL_DRIVE,\n      TAPE.LABEL_TIME AS LABEL_TIME,\n\n      TAPE.LAST_READ_DRIVE AS LAST_READ_DRIVE,\n      TAPE.LAST_READ_TIME AS LAST_READ_TIME,\n\n      TAPE.LAST_WRITE_DRIVE AS LAST_WRITE_DRIVE,\n      TAPE.LAST_WRITE_TIME AS LAST_WRITE_TIME,\n\n      TAPE.READ_MOUNT_COUNT AS READ_MOUNT_COUNT,\n      TAPE.WRITE_MOUNT_COUNT AS WRITE_MOUNT_COUNT,\n\n      TAPE.VERIFICATION_STATUS AS VERIFICATION_STATUS,\n\n      TAPE.USER_COMMENT AS USER_COMMENT,\n\n      TAPE.TAPE_STATE AS TAPE_STATE,\n      TAPE.STATE_REASON AS STATE_REASON,\n      TAPE.STATE_UPDATE_TIME AS STATE_UPDATE_TIME,\n      TAPE.STATE_MODIFIED_BY AS STATE_MODIFIED_BY,\n\n      TAPE.CREATION_LOG_USER_NAME AS CREATION_LOG_USER_NAME,\n      TAPE.CREATION_LOG_HOST_NAME AS CREATION_LOG_HOST_NAME,\n      TAPE.CREATION_LOG_TIME AS CREATION_LOG_TIME,\n\n      TAPE.LAST_UPDATE_USER_NAME AS LAST_UPDATE_USER_NAME,\n      TAPE.LAST_UPDATE_HOST_NAME AS LAST_UPDATE_HOST_NAME,\n      TAPE.LAST_UPDATE_TIME AS LAST_UPDATE_TIME \n    FROM \n      TAPE \n    INNER JOIN TAPE_POOL ON \n      TAPE.TAPE_POOL_ID = TAPE_POOL.TAPE_POOL_ID \n    INNER JOIN LOGICAL_LIBRARY ON \n      TAPE.LOGICAL_LIBRARY_ID = LOGICAL_LIBRARY.LOGICAL_LIBRARY_ID \n    LEFT JOIN PHYSICAL_LIBRARY ON \n      LOGICAL_LIBRARY.PHYSICAL_LIBRARY_ID = PHYSICAL_LIBRARY.PHYSICAL_LIBRARY_ID \n    INNER JOIN MEDIA_TYPE ON \n      TAPE.MEDIA_TYPE_ID = MEDIA_TYPE.MEDIA_TYPE_ID \n    INNER JOIN VIRTUAL_ORGANIZATION ON \n      TAPE_POOL.VIRTUAL_ORGANIZATION_ID = VIRTUAL_ORGANIZATION.VIRTUAL_ORGANIZATION_ID\n   WHERE \n      LOGICAL_LIBRARY.LOGICAL_LIBRARY_NAME = :LOGICAL_LIBRARY_NAME\n     AND \n      TAPE.TAPE_STATE = :TAPE_STATE\n     ORDER BY TAPE.VID : ORA-25401: can not continue fetches\nHelp: https://docs.oracle.com/error-help/db/ora-25401/\n"}

According to the Oracle DB documentation the solution is to restart the query from the beginning, but we are probably failing to do it:

ORA-25401: can not continue fetches Cause: A failure occured since the last fetch on this statement. Failover was unable to bring the statement to its original state to allow continued fetches. Action: Reexecute the statement and start fetching from the beginning

Steps to reproduce

Unknown.

What is the current bug behaviour?

The full stack trace can be found here:

{"epoch_time":1757324855.319861318,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"ERROR","pid":1275785,"tid":1275785,"message":"Error while scheduling new mount. Putting the drive down. Stack trace follows.","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainTh
read","tapeDrive":"IBMLIB1-LTO9-F12C4R2"}
{"epoch_time":1757324855.320131803,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":0,"trace
Frame":"/lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x6b) [0x7f221e1a9fcd]"}
{"epoch_time":1757324855.320239276,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":1,"trace
Frame":"/lib64/libctacommon.so.0(cta::exception::Exception::Exception(std::basic_string_view<char, std::char_traits<char> >, bool)+0x91) [0x7f221e1ab095]"}
{"epoch_time":1757324855.320330598,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":2,"trace
Frame":"/lib64/libctardbmsocci.so.0(cta::rdbms::wrapper::OcciRset::next()+0x19b) [0x7f22240dd81d]"}
{"epoch_time":1757324855.320447279,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":3,"trace
Frame":"/lib64/libctardbms.so.0(cta::rdbms::Rset::next()+0xe4) [0x7f2225de76dc]"}
{"epoch_time":1757324855.320534643,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":4,"trace
Frame":"/lib64/libctacatalogue.so.0(cta::catalogue::RdbmsTapeCatalogue::getTapes[abi:cxx11](cta::rdbms::Conn&, cta::catalogue::TapeSearchCriteria const&) const+0x2ed8) [0x7f222497e296]"}
{"epoch_time":1757324855.320688466,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":5,"trace
Frame":"/lib64/libctacatalogue.so.0(cta::catalogue::RdbmsTapeCatalogue::getTapes[abi:cxx11](cta::catalogue::TapeSearchCriteria const&) const+0x5e) [0x7f222496e0a4]"}
{"epoch_time":1757324855.320797777,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":6,"trace
Frame":"/lib64/libctacatalogue.so.0(+0x3ed43e) [0x7f22249ed43e]"}
{"epoch_time":1757324855.320911953,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":7,"trace
Frame":"/lib64/libctacatalogue.so.0(+0x3ef760) [0x7f22249ef760]"}
{"epoch_time":1757324855.321005735,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":8,"trace
Frame":"/lib64/libctacatalogue.so.0(cta::catalogue::TapeCatalogueRetryWrapper::getTapes[abi:cxx11](cta::catalogue::TapeSearchCriteria const&) const+0x52) [0x7f22249ed4aa]"}
{"epoch_time":1757324855.321186697,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":9,"trace
Frame":"/lib64/libctascheduler.so.0(cta::Scheduler::sortAndGetTapesForMountInfo(std::unique_ptr<cta::SchedulerDatabase::TapeMountDecisionInfo, std::default_delete<cta::SchedulerDatabase::TapeMountDecisionInfo> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::
allocator<char> > const&, cta::utils::Timer&, std::map<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, cta::common::dataStructures::MountType>, cta::Scheduler::MountCounts, std::less<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, cta::common::dataStructures::MountType> >, std:
:allocator<std::pair<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, cta::common::dataStructures::MountType> const, cta::Scheduler::MountCounts> > >&, std::map<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, cta::common::dataStructures::MountType>, cta::Scheduler::MountCounts,
std::less<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, cta::common::dataStructures::MountType> >, std::allocator<std::pair<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, cta::common::dataStructures::MountType> const, cta::Scheduler::MountCounts> > >&, std::set<std::__cxx11:
:basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<void>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::__cxx11::list<cta::catalogue::TapeForWriting, std::allocator<cta::catalogue::TapeForWriting> >&, double&, double&, double&, cta::log::LogContext&)+0x1b9) [0x7f22262a0aef]"}
{"epoch_time":1757324855.321400245,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":10,"trac
eFrame":"/lib64/libctascheduler.so.0(cta::Scheduler::getNextMountDryRun(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cta::log::LogContext&)+0x223) [0x7f22262a6779]"}
{"epoch_time":1757324855.321532841,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":11,"trac
eFrame":"/usr/bin/cta-taped() [0x50bc65]"}
{"epoch_time":1757324855.321633696,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":12,"trac
eFrame":"/usr/bin/cta-taped() [0x4b9bba]"}
{"epoch_time":1757324855.321718703,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":13,"trac
eFrame":"/usr/bin/cta-taped() [0x4b5e97]"}
{"epoch_time":1757324855.321802259,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":14,"trac
eFrame":"/usr/bin/cta-taped() [0x4efd09]"}
{"epoch_time":1757324855.321885104,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":15,"trac
eFrame":"/usr/bin/cta-taped() [0x4eecb3]"}
{"epoch_time":1757324855.321968306,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":16,"trac
eFrame":"/usr/bin/cta-taped() [0x4a79b0]"}
{"epoch_time":1757324855.322075615,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":17,"trac
eFrame":"/usr/bin/cta-taped() [0x4a76f0]"}
{"epoch_time":1757324855.322168086,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":18,"trac
eFrame":"/usr/bin/cta-taped() [0x4a71b1]"}
{"epoch_time":1757324855.322256195,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":19,"trac
eFrame":"/usr/bin/cta-taped() [0x491f05]"}
{"epoch_time":1757324855.322365674,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":20,"trac
eFrame":"/usr/bin/cta-taped() [0x492ab8]"}
{"epoch_time":1757324855.322452303,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":21,"trac
eFrame":"/lib64/libc.so.6(+0x295d0) [0x7f221d0295d0]"}
{"epoch_time":1757324855.322560454,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":22,"trac
eFrame":"/lib64/libc.so.6(__libc_start_main+0x80) [0x7f221d029680]"}
{"epoch_time":1757324855.322647996,"local_time":"2025-09-08T11:47:35+0200","hostname":"tpsrv064","program":"cta-taped","log_level":"INFO","pid":1275785,"tid":1275785,"message":"Stack trace","drive_name":"IBMLIB1-LTO9-F12C4R2","instance":"ctaproduction","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"IBMLIB1-LTO9-F12C4R2","traceFrameNumber":23,"trac
eFrame":"/usr/bin/cta-taped() [0x491975]"}

This failure seems to be triggered by the RdbmsTapeCatalogue::getTapes(...) function (RdbmsTapeCatalogue.cpp), which should automatically be retried, but this seems to be failing... (TapeCatalogueRetryWrapper.cpp).

What is the expected correct behaviour?

The CTA Catalogue retry wrapper should retry the query.

Relevant logs and/or screenshots

Possible causes

It seems that the retry wrapper requires an exception of type exception::LostDatabaseConnection to be caught. This error may be triggering a different exception, which leads to the retrying mechanism being bypassed.

Edited by Joao Afonso