Disk Write errors don't stop session
Update: See comment below
We discovered a new issue at Fermilab. It may only affect Enstore tapes (for reasons explained below). We see this a couple of times a day.
What's happening is this. We have an Enstore tape which we are happily reading, skipping by file marks to read the correct files. At some point we tell the drive to skip forward N file marks and it looks like it skipped N-1 file marks. It then reads a file, but the write fails the checksum test because it's the wrong file.
Next it skips a number of file marks and reads that file, again causing a write error because it's writing the wrong file. This may repeat for hundreds of files.
We think this is a tape drive problem as when I use cta-readtp
to read the last good file and then the first failed file, it gets the right files. And if the tape is mounted again, the failed files are read just fine.
Ideally one of two things would happen. We could recover this completely if a rewind and then forward seek to the first failed file was issued. But that wouldn't work for the more general case. We'd recover more quickly if, after some number of failed writes, the session was ended, the tape ejected, and work put back in the queue. It seems like this happens with tape read errors.
I think the recovery for CTA tapes were the same thing happening would be much faster because we'd try to read a file header, fail (because it would be a file or a file trailer), and retry.