Aggregate `stagerrm` issues

Over time, we have identified several issues - related to stagerrm on EOS - that need to be fixed.
Most probably, some of them overlap. Therefore, it's important to have a page where we can compare all of them and use that data to plan an unified approach to solve them.

Please edit the following table with all the stagerrm related issues. In addition, link all related issues here.

Problem (one-line description)	Corresponding tickets	Describe existing problem (current wrong behaviour)	Describe desired behaviour (what it should do to be correct)	Impact (low <-> high)

1. List of `stagerrm` operator issues

1.1. Do not remove file on unresponsive FTSs/broken disks

This affects stagerrm used by operators to remove files from one eos filesystem going away (broken disk or removal of server).

Corresponding tickets

stagerrm issues continued operations#943
- Issue: file is missing on FST while MGM thinks it is there: FST does not delete anything and MGM keeps referencing the replica as on this FST
- Workaround: manually drop replica from MGM
eos stagerrm refuses to remove replicas on drain+failed FSes operations#865
- Issue: file is on a broken disk and MGM is somehow unable to check existence of the file. It keeps referencing the replica
- Workaround: manually drop replica from MGM

1.2. Eviction counter getting in the way of operators

When an operator runs stagerrm to cleanup a disk (defective/going away) they should not have to run stagerrm multiple times to deal with the eviction counter. As it stands, one has to run the stagerrm command/Rundeck job once for each increment of the file's evict_counter, which can have values such as 18, 25, ...

Corresponding tickets

eos stagerrm refuses to remove replicas on drain+failed FSes operations#865
- see @rbachman comment: https://gitlab.cern.ch/cta/operations/-/issues/865#note_6155378

Desired behavior

The workflow triggered by the operator (Rundeck job launched through Grafana) should be indempotent, such that one execution forces the eviction of all files on the FS, irrespective of their eviction counters.

Impact

Low - Causes operator annoyance and delays in FS draining.

2. List of `stagerrm` gcd issues

cta-fst-gcd calls stagerrm to evict older files.

2.1. Evicting files on another filesystem

One FST has a really old file that is not referenced on the MGM and at every run it tries to stagerrm its local file with no effect as MGM thinks no FS beside tape has a replica.

One user recall the file from tape, it lands brand new and fresh on another FST and the old FST fst-gcd just triggers deletion of the fresh new replica 5 minutes after it has been recalled...

Corresponding tickets

cta-fst-gcd garbage collecting immediately because of orphaned file replicas on the FST operations#864
cta-fst-gcd garbage collecting wrong files again operations#958
Garbage Collector Issues #432 (closed)

3. List of `stagerrm` needed improvements

3.1. `stagerrm` dealing with 2 replica layout

As discussed in operations meeting and reported here for example: https://gitlab.cern.ch/cta/operations/-/issues/735#moving-spinners-to-2-replica-layout we need to move eosctapublicdisk to 2 replicas and improve stagerrm to deal with 2 replicas.

In this case when operator or fst-gcd cleanup the eviction counter should not be decreased if another disk replica exists: only the issuing FS should be cleaned up.

Edited Jan 17, 2023 by Joao Afonso

Admin message

Aggregate `stagerrm` issues

1. List of stagerrm operator issues

1.1. Do not remove file on unresponsive FTSs/broken disks

Corresponding tickets

1.2. Eviction counter getting in the way of operators

Corresponding tickets

Desired behavior

Impact

2. List of stagerrm gcd issues

2.1. Evicting files on another filesystem

Corresponding tickets

3. List of stagerrm needed improvements

3.1. stagerrm dealing with 2 replica layout

1. List of `stagerrm` operator issues

2. List of `stagerrm` gcd issues

3. List of `stagerrm` needed improvements

3.1. `stagerrm` dealing with 2 replica layout