Cataloging reasons for tape session failures

During Recall/Migration tape sessions, a lot of errors can occurr. The current error handling logic of the tape sessions is not ideal, as mostly it just revolves around stopping the tape session in progress, when there are better possible alternatives for handling some errors. For example https://gitlab.cern.ch/cta/CTA/-/issues/1096#note_5142172, when there is a mismatch between the checksum of the file and the calculated checksum in the session, the session is stopped in progress, instead of just reporting the error to the user and continuing the session. We should list the errors we can just report from those that warrant a canceled tape session (if any).

Please update this ticket description with any errors you have come across and relevant links.

Classification of errors

Errors reading file from disk

disk system unavailable or network issues - retryable
file does not exist (e.g. disk write failed/FST executes delete on close) - not retryable
bad file on disk - wrong size - not retryable??
bad file on disk - bad checksum - retryable
I/O error during transfer (disk file is OK but transfer was interrupted) - retry
I/O error during transfer (disk file is OK but data was corrupted in-flight) - retry
Error while reading a file: memory block not filled up, but the file is not fully read yet - retry

Errors writing file to tape

Wrong checksum of tape file written (i.e. if the checksum of the file written to tape is different from the one read from disk). This is detected during a tape flush. A batch of jobs is validated and added to the catalogue. If one job has a wrong checksum, the whole batch is failed. - retry/don't unmount (done)
Failed to start tape write session - not a part of retry logic
Drive is write protected. - not a part of retry logic
Aborting write session in presence of critical tape alerts - not a part of retry logic
Drive encryption could not be enabled for this mount - not a part of retry logic

Errors reading file from tape

#630 Files stuck in queue after tape state change - race condition where file is queued on a tape which is OK but tape changes to disabled before the tape mount. The request is neither failed nor retried, it gets stuck.
Tape's label is either missing or not valid - fatal error, not retryable
Drive encryption could not be enabled for this mount - outside retry logic
Session corrupted (positioning error) - retry

Errors writing file to disk

desynchronization between tape read and disk write - retry
failed to open disk file for writing - retry
failed to write payload to file - retry
failed to close the file - retry
Mismatch between expected and received fileid or blockid - retry

Any errors that don't fall into the above categories

Exception when unmounting/unloading the tape - session finished, nothing to retry
Drive is not found - tape server process should not be restarted if configuration is bad

Specific issues encountered in production

From https://gitlab.cern.ch/cta/CTA/-/issues/1096

Wrong File checksum in archive/retrieve sessions causes mount to be canceled
File size mismatch in archive/retrieve sessions causes mount to be canceled. (Done)

For both theses cases the best option is to report the error to the user and continue the mount.

From https://gitlab.cern.ch/cta/CTA/-/issues/1076

For retrieve sessions, if disk configuration for a request is removed, the request is processed as if it had no disk system (added in https://gitlab.cern.ch/cta/CTA/-/commit/537e47c1d48ce0f206e875ac0f26aeddf1a7214e)

From https://gitlab.cern.ch/cta/CTA/-/issues/1054

When disk system reservations are filled during a retrieve session, or if the disk system is unreachable, the mount is cancelled and the queue is put to sleep (for now only on full diskSystem case)

From https://gitlab.cern.ch/cta/operations/-/issues/665

Retrieve requests are lost when a tape is disabled

Edited Mar 30, 2022 by Michael Davis