Contention when queuing on archive queues. In memory staging needed. Then extend to retrieve queues and agent objects. General ticket for contentions.
After several investigations (see below and #58 (closed)), we should
-
Improve locking strategy of the root entry to find/create -
Archive queue -
Retrieve queue -
Create an in memory caching system for queuing/adding to: -
Archive queue -
Retrieve queue -
Agent
When running a test with 1000 files, German had many frontent threads competing for the same archive queue with this kind of stack trace:
Thread 6920 (Thread 0x7f777c678700 (LWP 404)):
#0 0x00007f77b6a636d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f77a3e74027 in Throttle::_wait(long) () from /lib64/librados.so.2
#2 0x00007f77a3e74d4f in Throttle::get(long, long) () from /lib64/librados.so.2
#3 0x00007f77a3e0a2af in Objecter::_throttle_op(Objecter::Op*, ceph::shunique_lock<boost::shared_mutex>&, int) () from /lib64/librados.so.2
#4 0x00007f77a3e171f8 in Objecter::_op_submit_with_budget(Objecter::Op*, ceph::shunique_lock<boost::shared_mutex>&, unsigned long*, int*) () from /lib64/librados.so.2
#5 0x00007f77a3e1742d in Objecter::op_submit(Objecter::Op*, unsigned long*, int*) () from /lib64/librados.so.2
#6 0x00007f77a3dd70c0 in librados::IoCtxImpl::operate(object_t const&, ObjectOperation*, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, int) () from /lib64/librados.so.2
#7 0x00007f77a3da03a3 in librados::IoCtx::operate(std::string const&, librados::ObjectWriteOperation*) () from /lib64/librados.so.2
#8 0x00007f77a409db61 in rados::cls::lock::lock(librados::IoCtx*, std::string const&, std::string const&, ClsLockType, std::string const&, std::string const&, std::string const&, utime_t const&, unsigned char) () from /lib64/librados.so.2
#9 0x00007f77a3da0caf in librados::IoCtx::lock_exclusive(std::string const&, std::string const&, std::string const&, std::string const&, timeval*, unsigned char) () from /lib64/librados.so.2
#10 0x00007f77b0da579a in cta::objectstore::BackendRados::lockExclusive (this=0x14b74e0, name="archiveQueue-OStoreDBFactory-ctafrontend-292-20170206-17:37:41-109") at /usr/src/debug/cta-0-76860git40d8e08c/objectstore/BackendRados.cpp:165
#11 0x00007f77b165e786 in cta::objectstore::ScopedExclusiveLock::lock (this=0x7f777c6766b0, oo=...) at /usr/src/debug/cta-0-76860git40d8e08c/objectstore/ObjectOps.hpp:231
#12 0x00007f77b165090d in cta::OStoreDB::getLockedAndFetchedArchiveQueue (this=0x14b84f0, archiveQueue=..., archiveQueueLock=..., tapePool="ctasystest") at /usr/src/debug/cta-0-76860git40d8e08c/scheduler/OStoreDB/OStoreDB.cpp:302
#13 0x00007f77b1650f20 in cta::OStoreDB::queueArchive (this=0x14b84f0, instanceName="ctaeos", request=..., criteria=...) at /usr/src/debug/cta-0-76860git40d8e08c/scheduler/OStoreDB/OStoreDB.cpp:350
#14 0x00007f77b163dedf in cta::Scheduler::queueArchive (this=0x1651f00, instanceName="ctaeos", request=...) at /usr/src/debug/cta-0-76860git40d8e08c/scheduler/Scheduler.cpp:78
#15 0x00007f77b1c09a66 in cta::xrootPlugins::XrdCtaFile::xCom_archive (this=0x7f7768302710) at /usr/src/debug/cta-0-76860git40d8e08c/xroot_plugins/XrdCtaFile.cpp:2029
#16 0x00007f77b1be290a in cta::xrootPlugins::XrdCtaFile::dispatchCommand (this=0x7f7768302710) at /usr/src/debug/cta-0-76860git40d8e08c/xroot_plugins/XrdCtaFile.cpp:165
#17 0x00007f77b1be31a7 in cta::xrootPlugins::XrdCtaFile::open (this=0x7f7768302710, fileName=0x7f777028b000 "/L3Vzci9iaW4vY3Rh&YXJjaGl2ZQ==&LS11c2Vy&YWRt&LS1ncm91cA==&YWRt&LS1kaXNraWQ=&Mjg1Mg==&LS1pbnN0YW5jZQ==&ZW9zY3Rh&LS1zcmN1cmw=&cm9vdDovL2N0YWVvcy5nZXJtYW4yMS5zdmMuY2x1c3Rlci5sb2NhbC8vZW9zL2N0YWVvcy9jdGEv"...,
openMode=0, createMode=384, client=0x7f77303874c8, opaque=0x0) at /usr/src/debug/cta-0-76860git40d8e08c/xroot_plugins/XrdCtaFile.cpp:215
#18 0x00007f77b7157340 in XrdXrootdProtocol::do_Open() () from /lib64/libXrdServer.so.2
#19 0x00007f77b6ed8ebd in XrdLink::DoIt() () from /lib64/libXrdUtils.so.2
#20 0x00007f77b6edc29f in XrdScheduler::Run() () from /lib64/libXrdUtils.so.2
#21 0x00007f77b6edc3e9 in XrdStartWorking(void*) () from /lib64/libXrdUtils.so.2
#22 0x00007f77b6e9fb57 in XrdSysThread_Xeq () from /lib64/libXrdUtils.so.2
#23 0x00007f77b6a5fdc5 in start_thread () from /lib64/libpthread.so.0
#24 0x00007f77b5d6573d in clone () from /lib64/libc.so.6
We need to create an in-memory staging singleton that will allow the queuing of many archive requests at once in an archive queue.
Similarly the same solution will be needed for retrieve queues and agent object.