Get rid of cta-rados-slow-calls/locking logs
Problem to solve
Currently the BackendRados objectstore logs the the slow calls and locking
Slow Calls sample output:
Jan 18 13:54:32.207211 PID="24668" TID="24702" op="In BackendRados::AsyncUpdater::commitCallback(): aio_write_full() callback" obj="RepackSubRequest-Maintenance-tpsrv675.cern.ch-26327-20240108-16:29:40-0-1383141" duration="9.24252"
Jan 18 13:54:32.208034 PID="24668" TID="24702" op="In BackendRados::AsyncUpdater::commitCallback(): aio_write_full() callback" obj="RepackSubRequest-Maintenance-tpsrv675.cern.ch-26327-20240108-16:29:40-0-1383022" duration="9.08921"
Jan 18 13:54:32.208235 PID="24668" TID="24702" op="In BackendRados::AsyncUpdater::fetchCallback(): aio_read callback" obj="RepackSubRequest-Maintenance-tpsrv675.cern.ch-26327-20240108-16:29:40-0-1383177" duration="8.29941"
Jan 18 13:54:32.208276 PID="24668" TID="24702" op="In BackendRados::AsyncUpdater::unlockCallback(): aio_unlock() callback" obj="RepackSubRequest-Maintenance-tpsrv675.cern.ch-26327-20240108-16:29:40-0-1383164" duration="8.29951"
Jan 18 13:54:37.517924 PID="24668" TID="24668" op="In BackendRados::lockBackoff(): radosCtx.lock_exclusive()" obj="RepackRequest-Frontend-ctaproductionfrontend02.cern.ch-20844-20240110-00:56:24-0-25725" duration="1.95364"
Locking sample output:
Thu Jan 18 13:51:19 2024 pid=24667 progname=/usr/bin/cta-taped totalCalls=2 averageAttempts=1 waitCount=0 minAttempts=1 maxAttempts=1 averageTotalTime=0.096724 minTotalTime=0.077932 maxTotalTime=0.115516 averageLatency=0.096724 minLatency=0.077932 maxLatency=0.115516 averageWaitTime=0 minWaitTime=0 maxWaitTime=0 averageLatencyMultiplier=0 minLatencyMultiplier=0 maxLatencyMultiplier=0 samplingTime=60.2281
Thu Jan 18 13:51:29 2024 pid=24668 progname=/usr/bin/cta-taped totalCalls=668 averageAttempts=156.754 waitCount=52022 minAttempts=1 maxAttempts=539 averageTotalTime=1.0415 minTotalTime=0.000741 maxTotalTime=5.36261 averageLatency=0.00107705 minLatency=0.000635 maxLatency=0.3761 averageWaitTime=0.0243185 minWaitTime=0 maxWaitTime=1.85 averageLatencyMultiplier=1.47518 minLatencyMultiplier=0 maxLatencyMultiplier=2 samplingTime=60.1799
Do we use this metrics for anything? Should we have something similar for Postgre backend?
Stakeholders
Proposal
- Do nothing as this will go away when we move to PostgreSQl.
- Log to taped log file (if this is not already happening).