TStore hanging and/or crashing after DB intervention
Dear XDAQ team,
Today during the sysadmin DB interventions several TStore processes were hanging and/or crashing.
A TStore hanging was showing this back trace after attaching GDB: {{{ Thread 3 (Thread 155556768 (LWP 30215)): #0 0x00a2c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 (closed) 0x00bbb5cb in __read_nocancel () from /lib/tls/libpthread.so.0 #2 (closed) 0x0174080f in snttread () from /opt/xdaq/lib/libclntsh.so.10.1 #3 0x0173dce1 in nttrd () from /opt/xdaq/lib/libclntsh.so.10.1 #4 (closed) 0x01690bc0 in nsprecv () from /opt/xdaq/lib/libclntsh.so.10.1 #5 (closed) 0x01694c8b in nsrdr () from /opt/xdaq/lib/libclntsh.so.10.1 #6 (closed) 0x016743ee in nsdo () from /opt/xdaq/lib/libclntsh.so.10.1 #7 (closed) 0x01673e97 in nsbrecv () from /opt/xdaq/lib/libclntsh.so.10.1 #8 (closed) 0x016a8567 in nioqrc () from /opt/xdaq/lib/libclntsh.so.10.1 #9 (closed) 0x017cacbf in ttcdrv () from /opt/xdaq/lib/libclntsh.so.10.1 #10 (closed) 0x016afbc8 in nioqwa () from /opt/xdaq/lib/libclntsh.so.10.1 #11 (closed) 0x0151b618 in upirtrc () from /opt/xdaq/lib/libclntsh.so.10.1 #12 (closed) 0x0151b18e in upirtr () from /opt/xdaq/lib/libclntsh.so.10.1 #13 (closed) 0x01490530 in kpurcs () from /opt/xdaq/lib/libclntsh.so.10.1 #14 0x01497b09 in kpu8lgn () from /opt/xdaq/lib/libclntsh.so.10.1 #15 (closed) 0x0149168a in kpuauthxa () from /opt/xdaq/lib/libclntsh.so.10.1 #16 (closed) 0x0149109f in kpuauth () from /opt/xdaq/lib/libclntsh.so.10.1 #17 (closed) 0x014cdb2b in kpuspextend () from /opt/xdaq/lib/libclntsh.so.10.1 #18 (closed) 0x014cedfe in kpuspgetfreesession () from /opt/xdaq/lib/libclntsh.so.10.1 #19 0x014cd1af in kpuspgetpooledsession () from /opt/xdaq/lib/libclntsh.so.10.1 #20 0x014cb589 in kpuspsessionget () from /opt/xdaq/lib/libclntsh.so.10.1 #21 (closed) 0x0152298e in OCISessionGet () from /opt/xdaq/lib/libclntsh.so.10.1 #22 (closed) 0x026b3d4f in oracle::occi::ConnectionImpl::openConnection () from /opt/xdaq/lib/libocci.so.10.1 #23 (closed) 0x026b6bd4 in oracle::occi::ConnectionImpl::ConnectionImpl () from /opt/xdaq/lib/libocci.so.10.1 #24 (closed) 0x026dd837 in oracle::occi::StatelessConnectionPoolImpl::getConnection () from /opt/xdaq/lib/libocci.so.10.1 #25 (closed) 0x02867f75 in tstore::GlobalStatelessConnectionPool::getConnection () from /opt/xdaq/lib/libtstoreutils.so #26 (closed) 0x0286b9b9 in tstore::OraclePoolConnection::prepareConnection () from /opt/xdaq/lib/libtstoreutils.so #27 (closed) 0x0286cc2c in tstore::OraclePoolConnection::openConnection () from /opt/xdaq/lib/libtstoreutils.so #28 (closed) 0x035a178e in tstore::TStore::connectWithBasicAuthentication () from /opt/xdaq/lib/libtstore.so #29 (closed) 0x035a5176 in tstore::TStore::connect () from /opt/xdaq/lib/libtstore.so #30 (closed) 0x035e11c5 in xoap::Methodtstore::TStore::invoke () from /opt/xdaq/lib/libtstore.so #31 (closed) 0x00ca545e in executive::SOAPDispatcher::processIncomingMessage () from /opt/xdaq/lib/libexecutive.so #32 (closed) 0x067a08a4 in pt::http::ReceiverLoop::onRequest () from /opt/xdaq/lib/libpthttp.so #33 (closed) 0x0679dcff in pt::http::ReceiverLoop::process () from /opt/xdaq/lib/libpthttp.so #34 (closed) 0x067ad2a0 in toolbox::task::Actionpt::http::ReceiverLoop::invoke () from /opt/xdaq/lib/libpthttp.so #35 (closed) 0x00e877fd in toolbox::task::WaitingWorkLoop::process () from /opt/xdaq/lib/libtoolbox.so #36 0x00e828d7 in toolbox::task::WorkLoop::run () from /opt/xdaq/lib/libtoolbox.so #37 (closed) 0x00e8091e in toolbox::task::thread_func () from /opt/xdaq/lib/libtoolbox.so #38 (closed) 0x00bb63cc in start_thread () from /lib/tls/libpthread.so.0 #39 (closed) 0x00f9cc3e in clone () from /lib/tls/libc.so.6 }}}
The hanging TStore didn't recover. And only after a restart (after each intervention) the process worked again.
We also observed several crashes like: {{{ Thread 1 (process 13010): #0 0x00000020 in ?? () #1 (closed) 0x03c40dff in tstore::OraclePoolConnection::commit () from /opt/xdaq/lib/libtstoreutils.so #2 (closed) 0x07af45a5 in LoggingConnection::commit () from /opt/xdaq/lib/libtstore.so #3 0x07b01cee in tstore::SQLViewInternal::insert () from /opt/xdaq/lib/libtstore.so #4 (closed) 0x07b0fd90 in tstore::SQLView::insert () from /opt/xdaq/lib/libtstore.so #5 (closed) 0x07a9c509 in tstore::TStore::insert () from /opt/xdaq/lib/libtstore.so #6 (closed) 0x07ad61c5 in xoap::Methodtstore::TStore::invoke () from /opt/xdaq/lib/libtstore.so #7 (closed) 0x0099a45e in executive::SOAPDispatcher::processIncomingMessage () from /opt/xdaq/lib/libexecutive.so #8 (closed) 0x00edf8a4 in pt::http::ReceiverLoop::onRequest () from /opt/xdaq/lib/libpthttp.so #9 (closed) 0x00edccff in pt::http::ReceiverLoop::process () from /opt/xdaq/lib/libpthttp.so #10 (closed) 0x00eec2a0 in toolbox::task::Actionpt::http::ReceiverLoop::invoke () from /opt/xdaq/lib/libpthttp.so #11 (closed) 0x002da7fd in toolbox::task::WaitingWorkLoop::process () from /opt/xdaq/lib/libtoolbox.so #12 (closed) 0x002d58d7 in toolbox::task::WorkLoop::run () from /opt/xdaq/lib/libtoolbox.so #13 (closed) 0x002d391e in toolbox::task::thread_func () from /opt/xdaq/lib/libtoolbox.so #14 0x007593cc in start_thread () from /lib/tls/libpthread.so.0 #15 (closed) 0x01039c3e in clone () from /lib/tls/libc.so.6
}}}
I CC to Jacek so he can explain a bit better the interventions (I think they just restarted some Oracle server instances). PErhaps, he could also help us on the recommended way to handle this situations in the OCCI library.
Cheers, marc
PS: The logs and stderr files has been lost.