Review severe operational issues of Sep/Oct 2023

Introduction

During the last couple weeks we had several operational issues which severely degraded the CTA service. The purpose of this issue is to go through them - all at once - and evaluate the dev approach being taken to fix them.

Some of the issues:

1. Missing archive/retrieve queue shards

A. Continuous stream of empty mounts

Missing shard in archive queue causes unhandled exception during mounting. Unable to perform archival, unless Public VO is entirely disabled!!

Ops issue reference:

https://gitlab.cern.ch/cta/operations/-/issues/1190

Operational fix:

Hand-craft an archive queue shard, to insert in the object store (same address, with valid values!). Tried multiple shards created by Michael, Joao, Julien.
Try to insert shard and re-enable Public VO mounts. Last shard created by Julien worked.
Let shard be drained...
(!) At the end, the queue was not fully removed.
@jleduc removed the last bits manually, and reinjected the EOS archive jobs (https://gitlab.cern.ch/cta/operations/-/work_items/1193).

B. Continuous stream of empty staging mounts

Missing shard in retrieve queue causes unhandled exception during retrieve request queueing. Causes tape server (maintenance process) to crash during requeuing.

Ops issue reference:

https://gitlab.cern.ch/cta/operations/-/issues/1201

Operational fix:

Move tape state to BROKEN, to allow it to be drained.
Delete queue and dereferenced manually.

Permanent dev fix

The operational approach (ex. crafting and inserting new shards) allowed us to fix the issue without any development work. However, we have identified several issues in the code which were tackled:

#500 (closed) Safely handle empty shards in object store

Two main problems:

Object store code could not handle missing shards. An unhandled exception was raised.
A shard was deleted before dereferencing from the queue objects. Therefore, any exception would cause a dangling reference.

Fixes done (brief summary):

Ignore missing shards during job dumping.
Remove/clean missing shards during job popping.
Log missing shard ERRORs.
Remove reference before deleting shard object.
Add unit test.

Fixes not done (to discuss):

Handle missing shard during job queueing.
- Easy code in archive queue logic, but very complex in retrieve queue logic.
- Redundant, after the fixes presented above (dereferencing before deleting).
- High risk, low reward.

#503 Improve handling of try-catch blocks of 'cta::exception::Exception'

Problem description:

All these shard-related exceptions contain a backtrace (cta::exception::Exception), but the backtrace was not being properly logged.
Logging the backtrace would have made debugging much easier.
The way how we handle exceptions follows several bad-practices, as shown by SonarCloud (check link).

Proposed fix:

Fix SonarCloud errors.
Replacing the throwing/catching of cta::exception::Exception by more meaningful exceptions.
When catching cta::exception::Exception is necessary (ex. at the top of a thread/process stack before letting the thread/process fail), we should log the backtrace. No unhandled backtrace should be lost.

2. Several lock-related errors in production taped servers

Explicitly unlocking an already unlocked lock will throw an exception. This happened during the garbage collection of repack retrieve requests, causing a tape server to crash for every requeuing. Queue took hours to be GC'ed.

Ops issue reference:

https://gitlab.cern.ch/cta/operations/-/issues/1186

Fix done:

#460 (comment 7109417) Fix "trying to unlock an unlocked lock" error

3. (Service degradation) Slow object store requests

Ops issue:

https://gitlab.cern.ch/cta/operations/-/issues/1186#note_7156535

We got a very high number of these warnings:

In OStoreDB::fetchMountInfo(): fetched a retrieve queue and that lasted more than 1 second

Logs show that taking a lock or object could take >20 seconds, sometimes more.
Service (ex. cta-admin) was very slow.
Might be the original cause of the missing shards.
Unclear what caused it to slow, but seems related to repacking...

Dev approach (to discuss):

It's unclear why repack caused this (if it was repack at all!).
Inspecting the logs is not helpful enough.
To find the cause, we need a better metric logging system. It's important to move forward with this:
- #266
- Measuring the object store operations is a great POC use-case.

Edited Oct 06, 2023 by Julien Leduc