Cataloging reasons for tape session failures
During Recall/Migration tape sessions, a lot of errors can occurr. The current error handling logic of the tape sessions is not ideal, as mostly it just revolves around stopping the tape session in progress, when there are better possible alternatives for handling some errors. For example https://gitlab.cern.ch/cta/CTA/-/issues/1096#note_5142172, when there is a mismatch between the checksum of the file and the calculated checksum in the session, the session is stopped in progress, instead of just reporting the error to the user and continuing the session. We should list the errors we can just report from those that warrant a canceled tape session (if any).
Please update this ticket description with any errors you have come across and relevant links.
Classification of errors
Errors reading file from disk
- disk system unavailable or network issues - retryable
- file does not exist (e.g. disk write failed/FST executes delete on close) - not retryable
- bad file on disk - wrong size - not retryable??
- bad file on disk - bad checksum - retryable
- I/O error during transfer (disk file is OK but transfer was interrupted) - retry
- I/O error during transfer (disk file is OK but data was corrupted in-flight) - retry
- Error while reading a file: memory block not filled up, but the file is not fully read yet - retry
Errors writing file to tape
- Wrong checksum of tape file written (i.e. if the checksum of the file written to tape is different from the one read from disk). This is detected during a tape flush. A batch of jobs is validated and added to the catalogue. If one job has a wrong checksum, the whole batch is failed. - retry/don't unmount (done)
- Failed to start tape write session - not a part of retry logic
- Drive is write protected. - not a part of retry logic
- Aborting write session in presence of critical tape alerts - not a part of retry logic
- Drive encryption could not be enabled for this mount - not a part of retry logic
Errors reading file from tape
- #630 Files stuck in queue after tape state change - race condition where file is queued on a tape which is OK but tape changes to disabled before the tape mount. The request is neither failed nor retried, it gets stuck.
- Tape's label is either missing or not valid - fatal error, not retryable
- Drive encryption could not be enabled for this mount - outside retry logic
- Session corrupted (positioning error) - retry
Errors writing file to disk
- desynchronization between tape read and disk write - retry
- failed to open disk file for writing - retry
- failed to write payload to file - retry
- failed to close the file - retry
- Mismatch between expected and received fileid or blockid - retry
Any errors that don't fall into the above categories
- Exception when unmounting/unloading the tape - session finished, nothing to retry
- Drive is not found - tape server process should not be restarted if configuration is bad
Specific issues encountered in production
From https://gitlab.cern.ch/cta/CTA/-/issues/1096
- Wrong File checksum in archive/retrieve sessions causes mount to be canceled
- File size mismatch in archive/retrieve sessions causes mount to be canceled. (Done)
For both theses cases the best option is to report the error to the user and continue the mount.
From https://gitlab.cern.ch/cta/CTA/-/issues/1076
- For retrieve sessions, if disk configuration for a request is removed, the request is processed as if it had no disk system (added in https://gitlab.cern.ch/cta/CTA/-/commit/537e47c1d48ce0f206e875ac0f26aeddf1a7214e)
From https://gitlab.cern.ch/cta/CTA/-/issues/1054
- When disk system reservations are filled during a retrieve session, or if the disk system is unreachable, the mount is cancelled and the queue is put to sleep (for now only on full diskSystem case)
From https://gitlab.cern.ch/cta/operations/-/issues/665
- Retrieve requests are lost when a tape is disabled