Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • CTA CTA
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
    • Locked files
  • Issues 133
    • Issues 133
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 15
    • Merge requests 15
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Artifacts
    • Schedules
    • Test cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
    • Model experiments
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar

Admin message

GitLab Runners cleaning campaign ongoing (ETA 2nd October 2023). Check https://cern.ch/otg0078219 for further information.

  • ctacta
  • CTACTA
  • Issues
  • #50
Closed
Open
Issue created Dec 17, 2020 by Cedric Caffy@ccaffyMaintainer

Revise the cleanup logic of the tapeserver

In the issue Possible SCSI errors on tpsrv311 I3601444, we have seen that a tape took more than 60 seconds to timeout and therefore, failed to be mounted.

The jobs from the queue were popped and failed. BUT the cleaner of the TapeWriteSingleThread has not ejected the tape:

[1608174151.288414000] Dec 17 04:02:31.288414 tpsrv311 cta-taped: LVL="INFO" PID="16883" TID="11837" MSG="TapeReadSingleThread: No tape to unload" thread="TapeWrite" tapeDrive="I3601444" tapeVid="I70358" mountId="45752"

(Note: There is a typo TapeReadSingleThread should be TapeWriteSingleThread in the logs.)

Here is the code of the TapeWriteSingleThread::TapeCleaning::~TapeCleaning() method:

try {
      m_this.m_drive.waitUntilReady(waitMediaInDriveTimeout);
    } catch (cta::exception::TimeOut &) {}
    if (!m_this.m_drive.hasTapeInPlace()) {
      m_this.m_logContext.log(cta::log::INFO, "TapeReadSingleThread: No tape to unload");
      goto done;
    }

We saw that the problematic drive from the issue Possible SCSI errors on tpsrv311 I3601444 still contained the tape in it:

[root@tpsrv311 ~]# mt -f /dev/nst0 status
SCSI 2 tape drive:
File number=0, block number=0, partition=0.
Tape block size 0 bytes. Density code 0x57 (no translation).
Soft error count since last status=0
General status bits on (41010000):
 BOT ONLINE IM_REP_EN
[root@tpsrv311 ~]# 

So the drive has not reported to the tapeserver that the tape was still in place !

The tapeserver CleanerSession has not kicked in as well.

The cleanup logic of the tapeserver has to be revised.

The following algorithm can be followed:

  1. Ask the library what's in the drive.
  2. Ask the drive for its state.
  3. If the library says there's a tape and the drive says it's empty, then a tape may be in the process of being loaded so wait 5 minutes.
  4. If the tape drive is not empty then eject.
  5. If eject fails then put drive Down and DISABLE the tape.
  6. Query the library for what might be hanging out the door of the drive.
  7. If there is something hanging out the drive then ask the library to put it back in its storage slot.
  8. If returning the tape to its storage slot fails then put drive Down and DISABLE the tape.
Assignee
Assign to
Time tracking