Safely ignore deleted requests when requeueing a job queue
! For more details check ops issue https://gitlab.cern.ch/cta/operations/-/issues/994#note_6444520.
Problem
- When the thread
RecallTaskInjector
fromcta-taped
is unable to reserve the required disk space for the retrieve jobs, it applies backpressure by requeueing all the popped objects and by disabling the queue. - This is done with
m_retrieveMount.requeueJobBatch(m_jobs, m_lc);
, inRecallTaskInjector.cpp
. - Any non-existing object will throw a
cta::exception::NoSuchObject
exception, which will cause the whole process to crash. This should not happen. - Request object may be deleted at any time, for example after canceling a request. Therefore, situations like this are expected and should be handled graciously.
Here is an example of a crash, triggered by requeueing the non-existing object RetrieveRequest-Frontend-ctaproductionfrontend01.cern.ch-23457-20230126-14:50:53-0-3217436
:
[1676514497.649313000] Feb 16 03:28:17.649313 tpsrv045.cern.ch cta-taped: LVL="ERROR" PID="22410" TID="22410" MSG="Aborting cta-taped on uncaught exception. Stack trace follows." Message="Uncaught exception of type 'cta::exception::NoSuchObject' in Thread.run(): >>>>In BackendRados::lockBackoff(): trying to lock a non-existing object: RetrieveRequest-F
rontend-ctaproductionfrontend01.cern.ch-23457-20230126-14:50:53-0-3217436 /lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x69) [0x7f36476a87f9] /lib64/libctacommon.so.0(cta::exception::Exception::Exception(std::string const&, bool)+0x89) [0x7f36476aa05d] /usr/bin/cta-taped(cta::exception::NoSuchObject::NoSuchObject(std::string cons
t&)+0x37) [0x46e8b7] /lib64/libctaobjectstore.so.0(cta::objectstore::BackendRados::lockBackoff(std::string const&, unsigned long, cta::objectstore::BackendRados::LockType, std::string const&, librados::v14_2_0::IoCtx&)+0xd50) [0x7f36499beebe] /lib64/libctaobjectstore.so.0(cta::objectstore::BackendRados::lock(std::string const&, unsigned long, cta::obje
ctstore::BackendRados::LockType, std::string const&)+0x5a) [0x7f36499a7ca6] /lib64/libctaobjectstore.so.0(cta::objectstore::BackendRados::lockExclusive(std::string const&, unsigned long)+0x52) [0x7f36499a8bd0] /lib64/libctascheduler.so.0(cta::objectstore::ScopedExclusiveLock::lock(cta::objectstore::ObjectOpsBase&, unsigned long)+0x9d) [0x7f364a21f529]
/lib64/libctascheduler.so.0(cta::objectstore::ScopedExclusiveLock::ScopedExclusiveLock(cta::objectstore::ObjectOpsBase&, unsigned long)+0x59) [0x7f364a21f3a5] /lib64/libctascheduler.so.0(void __gnu_cxx::new_allocator<std::_List_node<cta::objectstore::ScopedExclusiveLock> >::construct<cta::objectstore::ScopedExclusiveLock, cta::objectstore::RetrieveRequ
est&>(cta::objectstore::ScopedExclusiveLock*, cta::objectstore::RetrieveRequest&)+0x5b) [0x7f364a2f7847] /lib64/libctascheduler.so.0(void std::allocator_traits<std::allocator<std::_List_node<cta::objectstore::ScopedExclusiveLock> > >::construct<cta::objectstore::ScopedExclusiveLock, cta::objectstore::RetrieveRequest&>(std::allocator<std::_List_node<cta
::objectstore::ScopedExclusiveLock> >&, cta::objectstore::ScopedExclusiveLock*, cta::objectstore::RetrieveRequest&)+0x45) [0x7f364a2da9f3] /lib64/libctascheduler.so.0(std::_List_node<cta::objectstore::ScopedExclusiveLock>* std::list<cta::objectstore::ScopedExclusiveLock, std::allocator<cta::objectstore::ScopedExclusiveLock> >::_M_create_node<cta::objec
tstore::RetrieveRequest&>(cta::objectstore::RetrieveRequest&)+0x87) [0x7f364a2b4179] /lib64/libctascheduler.so.0(void std::list<cta::objectstore::ScopedExclusiveLock, std::allocator<cta::objectstore::ScopedExclusiveLock> >::_M_insert<cta::objectstore::RetrieveRequest&>(std::_List_iterator<cta::objectstore::ScopedExclusiveLock>, cta::objectstore::Retrie
veRequest&)+0x41) [0x7f364a272e69] /lib64/libctascheduler.so.0(cta::objectstore::ScopedExclusiveLock& std::list<cta::objectstore::ScopedExclusiveLock, std::allocator<cta::objectstore::ScopedExclusiveLock> >::emplace_back<cta::objectstore::RetrieveRequest&>(cta::objectstore::RetrieveRequest&)+0x50) [0x7f364a238faa] /lib64/libctascheduler.so.0(cta::OStor
eDB::RetrieveMount::requeueJobBatch(std::list<std::unique_ptr<cta::SchedulerDatabase::RetrieveJob, std::default_delete<cta::SchedulerDatabase::RetrieveJob> >, std::allocator<std::unique_ptr<cta::SchedulerDatabase::RetrieveJob, std::default_delete<cta::SchedulerDatabase::RetrieveJob> > > >&, cta::log::LogContext&)+0x1bd) [0x7f364a1f7943] /lib64/libctasc
heduler.so.0(cta::RetrieveMount::requeueJobBatch(std::vector<std::unique_ptr<cta::RetrieveJob, std::default_delete<cta::RetrieveJob> >, std::allocator<std::unique_ptr<cta::RetrieveJob, std::default_delete<cta::RetrieveJob> > > >&, cta::log::LogContext&)+0x11c) [0x7f364a13790c] /usr/bin/cta-taped() [0x4d4870] /usr/bin/cta-taped() [0x4d5132] /usr/bin/cta
-taped() [0x4d6416] /lib64/libctacommon.so.0(cta::threading::Thread::pthread_runner(void*)+0xef) [0x7f36476d04d5] /lib64/libpthread.so.0(+0x7ea5) [0x7f3647382ea5] /lib64/libc.so.6(clone+0x6d) [0x7f36412c0b0d] <<<< End of uncaught exception"
Solution
- Any non-existing request object should be ignored when requeueing, after a
cta::exception::NoSuchObject
exception is detected (just log a WARNING or ERROR). - This should probably be done inside
m_retrieveMount.requeueJobBatch(m_jobs, m_lc)
.