Revise the cleanup logic of the tapeserver
In the issue Possible SCSI errors on tpsrv311 I3601444, we have seen that a tape took more than 60 seconds to timeout and therefore, failed to be mounted.
The jobs from the queue were popped and failed. BUT the cleaner of the TapeWriteSingleThread
has not ejected the tape:
[1608174151.288414000] Dec 17 04:02:31.288414 tpsrv311 cta-taped: LVL="INFO" PID="16883" TID="11837" MSG="TapeReadSingleThread: No tape to unload" thread="TapeWrite" tapeDrive="I3601444" tapeVid="I70358" mountId="45752"
(Note: There is a typo TapeReadSingleThread
should be TapeWriteSingleThread
in the logs.)
Here is the code of the TapeWriteSingleThread::TapeCleaning::~TapeCleaning()
method:
try {
m_this.m_drive.waitUntilReady(waitMediaInDriveTimeout);
} catch (cta::exception::TimeOut &) {}
if (!m_this.m_drive.hasTapeInPlace()) {
m_this.m_logContext.log(cta::log::INFO, "TapeReadSingleThread: No tape to unload");
goto done;
}
We saw that the problematic drive from the issue Possible SCSI errors on tpsrv311 I3601444 still contained the tape in it:
[root@tpsrv311 ~]# mt -f /dev/nst0 status
SCSI 2 tape drive:
File number=0, block number=0, partition=0.
Tape block size 0 bytes. Density code 0x57 (no translation).
Soft error count since last status=0
General status bits on (41010000):
BOT ONLINE IM_REP_EN
[root@tpsrv311 ~]#
So the drive has not reported to the tapeserver that the tape was still in place !
The tapeserver CleanerSession
has not kicked in as well.
The cleanup logic of the tapeserver has to be revised.
The following algorithm can be followed:
- Ask the library what's in the drive.
- Ask the drive for its state.
- If the library says there's a tape and the drive says it's empty, then a tape may be in the process of being loaded so wait 5 minutes.
- If the tape drive is not empty then eject.
- If eject fails then put drive Down and DISABLE the tape.
- Query the library for what might be hanging out the door of the drive.
- If there is something hanging out the drive then ask the library to put it back in its storage slot.
- If returning the tape to its storage slot fails then put drive Down and DISABLE the tape.