Review severe operational issues of Sep/Oct 2023
Introduction
During the last couple weeks we had several operational issues which severely degraded the CTA service. The purpose of this issue is to go through them - all at once - and evaluate the dev approach being taken to fix them.
Some of the issues:
1. Missing archive/retrieve queue shards
A. Continuous stream of empty mounts
Missing shard in archive queue causes unhandled exception during mounting. Unable to perform archival, unless Public VO is entirely disabled!!
Ops issue reference:
Operational fix:
- Hand-craft an archive queue shard, to insert in the object store (same address, with valid values!). Tried multiple shards created by Michael, Joao, Julien.
- Try to insert shard and re-enable Public VO mounts. Last shard created by Julien worked.
- Let shard be drained...
- (!) At the end, the queue was not fully removed.
- @jleduc removed the last bits manually, and reinjected the EOS archive jobs (https://gitlab.cern.ch/cta/operations/-/work_items/1193).
B. Continuous stream of empty staging mounts
Missing shard in retrieve queue causes unhandled exception during retrieve request queueing. Causes tape server (maintenance process) to crash during requeuing.
Ops issue reference:
Operational fix:
- Move tape state to BROKEN, to allow it to be drained.
- Delete queue and dereferenced manually.
Permanent dev fix
The operational approach (ex. crafting and inserting new shards) allowed us to fix the issue without any development work. However, we have identified several issues in the code which were tackled:
#500 (closed) Safely handle empty shards in object store
Two main problems:
- Object store code could not handle missing shards. An unhandled exception was raised.
- A shard was deleted before dereferencing from the queue objects. Therefore, any exception would cause a dangling reference.
Fixes done (brief summary):
- Ignore missing shards during job dumping.
- Remove/clean missing shards during job popping.
- Log missing shard ERRORs.
- Remove reference before deleting shard object.
- Add unit test.
Fixes not done (to discuss):
- Handle missing shard during job queueing.
- Easy code in archive queue logic, but very complex in retrieve queue logic.
- Redundant, after the fixes presented above (dereferencing before deleting).
- High risk, low reward.
#503 Improve handling of try-catch blocks of 'cta::exception::Exception'
Problem description:
- All these shard-related exceptions contain a backtrace (
cta::exception::Exception
), but the backtrace was not being properly logged. - Logging the backtrace would have made debugging much easier.
- The way how we handle exceptions follows several bad-practices, as shown by SonarCloud (check link).
Proposed fix:
- Fix SonarCloud errors.
- Replacing the throwing/catching of
cta::exception::Exception
by more meaningful exceptions. - When catching
cta::exception::Exception
is necessary (ex. at the top of a thread/process stack before letting the thread/process fail), we should log the backtrace. No unhandled backtrace should be lost.
2. Several lock-related errors in production taped servers
Explicitly unlocking an already unlocked lock will throw an exception. This happened during the garbage collection of repack retrieve requests, causing a tape server to crash for every requeuing. Queue took hours to be GC'ed.
Ops issue reference:
Fix done:
- #460 (comment 7109417) Fix "trying to unlock an unlocked lock" error
3. (Service degradation) Slow object store requests
Ops issue:
We got a very high number of these warnings:
In OStoreDB::fetchMountInfo(): fetched a retrieve queue and that lasted more than 1 second
- Logs show that taking a lock or object could take >20 seconds, sometimes more.
- Service (ex.
cta-admin
) was very slow. - Might be the original cause of the missing shards.
- Unclear what caused it to slow, but seems related to repacking...
Dev approach (to discuss):
- It's unclear why repack caused this (if it was repack at all!).
- Inspecting the logs is not helpful enough.
- To find the cause, we need a better metric logging system. It's important to move forward with this:
- #266
- Measuring the object store operations is a great POC use-case.