Aggregate `stagerrm` issues
Over time, we have identified several issues - related to stagerrm
on EOS - that need to be fixed.
Most probably, some of them overlap. Therefore, it's important to have a page where we can compare all of them and use that data to plan an unified approach to solve them.
Please edit the following table with all the stagerrm
related issues. In addition, link all related issues here.
Problem (one-line description) |
Corresponding tickets | Describe existing problem (current wrong behaviour) |
Describe desired behaviour (what it should do to be correct) |
Impact (low <-> high) |
---|---|---|---|---|
stagerrm
operator issues
1. List of 1.1. Do not remove file on unresponsive FTSs/broken disks
This affects stagerrm
used by operators to remove files from one eos filesystem going away (broken disk or removal of server).
Corresponding tickets
- stagerrm issues continued operations#943
- Issue: file is missing on FST while MGM thinks it is there: FST does not delete anything and MGM keeps referencing the replica as on this FST
- Workaround: manually drop replica from MGM
-
eos stagerrm
refuses to remove replicas ondrain+failed
FSes operations#865- Issue: file is on a broken disk and MGM is somehow unable to check existence of the file. It keeps referencing the replica
- Workaround: manually drop replica from MGM
1.2. Eviction counter getting in the way of operators
When an operator runs stagerrm
to cleanup a disk (defective/going away) they should not have to run stagerrm multiple times to deal with the eviction counter. As it stands, one has to run the stagerrm command/Rundeck job once for each increment of the file's evict_counter
, which can have values such as 18, 25, ...
Corresponding tickets
-
eos stagerrm
refuses to remove replicas ondrain+failed
FSes operations#865
Desired behavior
- The workflow triggered by the operator (Rundeck job launched through Grafana) should be indempotent, such that one execution forces the eviction of all files on the FS, irrespective of their eviction counters.
Impact
Low - Causes operator annoyance and delays in FS draining.
stagerrm
gcd issues
2. List of cta-fst-gcd
calls stagerrm
to evict older files.
2.1. Evicting files on another filesystem
One FST has a really old file that is not referenced on the MGM and at every run it tries to stagerrm
its local file with no effect as MGM thinks no FS beside tape has a replica.
One user recall the file from tape, it lands brand new and fresh on another FST and the old FST fst-gcd just triggers deletion of the fresh new replica 5 minutes after it has been recalled...
Corresponding tickets
-
cta-fst-gcd
garbage collecting immediately because of orphaned file replicas on the FST operations#864 - cta-fst-gcd garbage collecting wrong files again operations#958
- Garbage Collector Issues #432 (closed)
stagerrm
needed improvements
3. List of
stagerrm
dealing with 2 replica layout
3.1. As discussed in operations meeting and reported here for example: https://gitlab.cern.ch/cta/operations/-/issues/735#moving-spinners-to-2-replica-layout we need to move eosctapublicdisk to 2 replicas and improve stagerrm to deal with 2 replicas.
In this case when operator or fst-gcd cleanup the eviction counter should not be decreased if another disk replica exists: only the issuing FS should be cleaned up.