Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • CTA CTA
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 123
    • Issues 123
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 14
    • Merge requests 14
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • ctacta
  • CTACTA
  • Issues
  • #269
Closed
Open
Issue created Jan 16, 2023 by Joao Afonso@afonsoOwner

Aggregate `stagerrm` issues

Over time, we have identified several issues - related to stagerrm on EOS - that need to be fixed.
Most probably, some of them overlap. Therefore, it's important to have a page where we can compare all of them and use that data to plan an unified approach to solve them.

Please edit the following table with all the stagerrm related issues. In addition, link all related issues here.


Problem
(one-line description)
Corresponding tickets Describe existing problem
(current wrong behaviour)
Describe desired behaviour
(what it should do to be correct)
Impact
(low <-> high)

1. List of stagerrm operator issues

1.1. Do not remove file on unresponsive FTSs/broken disks

This affects stagerrm used by operators to remove files from one eos filesystem going away (broken disk or removal of server).

Corresponding tickets

  • stagerrm issues continued operations#943
    • Issue: file is missing on FST while MGM thinks it is there: FST does not delete anything and MGM keeps referencing the replica as on this FST
    • Workaround: manually drop replica from MGM
  • eos stagerrm refuses to remove replicas on drain+failed FSes operations#865
    • Issue: file is on a broken disk and MGM is somehow unable to check existence of the file. It keeps referencing the replica
    • Workaround: manually drop replica from MGM

1.2. Eviction counter getting in the way of operators

When an operator runs stagerrm to cleanup a disk (defective/going away) they should not have to run stagerrm multiple times to deal with the eviction counter. As it stands, one has to run the stagerrm command/Rundeck job once for each increment of the file's evict_counter, which can have values such as 18, 25, ...

Corresponding tickets

  • eos stagerrm refuses to remove replicas on drain+failed FSes operations#865
    • see @rbachman comment: https://gitlab.cern.ch/cta/operations/-/issues/865#note_6155378

Desired behavior

  • The workflow triggered by the operator (Rundeck job launched through Grafana) should be indempotent, such that one execution forces the eviction of all files on the FS, irrespective of their eviction counters.

Impact

Low - Causes operator annoyance and delays in FS draining.

2. List of stagerrm gcd issues

cta-fst-gcd calls stagerrm to evict older files.

2.1. Evicting files on another filesystem

One FST has a really old file that is not referenced on the MGM and at every run it tries to stagerrm its local file with no effect as MGM thinks no FS beside tape has a replica.

One user recall the file from tape, it lands brand new and fresh on another FST and the old FST fst-gcd just triggers deletion of the fresh new replica 5 minutes after it has been recalled...

Corresponding tickets

  • cta-fst-gcd garbage collecting immediately because of orphaned file replicas on the FST operations#864
  • cta-fst-gcd garbage collecting wrong files again operations#958
  • Garbage Collector Issues #432

3. List of stagerrm needed improvements

3.1. stagerrm dealing with 2 replica layout

As discussed in operations meeting and reported here for example: https://gitlab.cern.ch/cta/operations/-/issues/735#moving-spinners-to-2-replica-layout we need to move eosctapublicdisk to 2 replicas and improve stagerrm to deal with 2 replicas.

In this case when operator or fst-gcd cleanup the eviction counter should not be decreased if another disk replica exists: only the issuing FS should be cleaned up.

Edited Jan 17, 2023 by Joao Afonso
Assignee
Assign to
Time tracking