As discussed on #124 (closed), we need to add a new column in the catalogue to mark files that should be ignored during a user/repack retrieve.
The requests for these files should fail graciously and not be performed.
In user retrieves:
The retrieve should fail and an error returned to the user.
Something like: File exists but is unavailable.
In repack retrieves:
The retrieve should fail and an error returned to the repack queue.
The failed files counter should be increased.
EDIT: As discussed in #218 (comment 6292807), we will go for a simplified approach that just requires removing all retries from repack requests. It will be up to the operator to decide what to do with the failing tape files (either retry manually or other option).
In both cases, if there is dual tape copy, the file should be enqueued to the working tape copy.
As mentioned on #124 (comment 6248551), the flag INCLUDING DEFAULTS flag had to be added to the creation of the temporary table TEMP_TAPE_FILE_INSERTION_BATCH .
This is a pre-requisite for the catalogue schema change that adds the IS_ACCESSIBLE column to the TAPE_FILE table in postgres.
The flag INCLUDING DEFAULTS should be removed once the catalogue change has been deployed and merged to main. It can be replaced by directly referencing to the newly added column IS_ACCESSIBLE.
Pass a reference to the catalogue (&m_scheduler.getCatalogue()) as part of the constructor arguments of the RecallTaskInjector object. The catalogue is currently passed as an argument of RecallTaskInjector::initRAO but, by making it a class field field, we can make it available to all methods.
Extend the cta::RetrieveJob class by adding a new field is_unavailable. This field would be set inside of RecallTaskInjector::injectBulkRecalls (TBC?).
Fail the task in DiskWriteTask::execute(...).
Advantages:
Only checks file on the moment that we will be reading it.
Disadvantage:
More complex.
Possible strategy 2
Check the catalogue immediately before the file is queued. If it's is_unavailable then fail the job immediately.
Advantages:
Simpler to implement.
Disadvantages
Only checks the file before it is enqueued, which may be hours or days before the actual retrieve.
Open questions
How to avoid doing too many queries to the DB? Ideally, we should do a single query for all files in a single VID.
Some notes:
Retrieve requests (recalls) are managed and ordered by the tapeserver here: RecallTaskInjector::injectBulkRecalls()
Reads are executed here: castor::tape::tapeserver::daemon::DataTransferSession::executeRead()
Tasks are injected asynchrounously by the task injector, triggered in TapeReadSingleThread::popAndRequestMoreJobs().
Jobs are popped by the tapeserver with this line auto jobsList = m_retrieveMount.getNextJobBatch(reqFiles, reqSize, m_lc); on RecallTaskInjector.cpp:322.
As discussed in the recent CTA development meeting last week, we need to properly outline all possible scenarios how to handle a tape with problematic areas.
In case of a tape with problematic areas, the highest priority is to quickly move away (onto another tape) as many files as possible (= repack). The reason is that usually, there is a user waiting for majority of the files on that tape and the repack process shouldn't block the user more than necessary (i.e. one tape read + one tape write).
Currently, the cta-taped is designed to perform multiple re-tries when doing a file retrieve from a problematic tape. Multiple re-tries within the mount and then at least 2 different mounts. These retries are blocking the repack operation, hence introducing a significant delay for the user to retrieve quickly at least some portion of the good files. (Not to forget that re-tries are also putting stress on the drive and the tape media (which hopefully still contains lot of good files).)
In order to shorten this time, there are 2 possible option:
Option 1 (historical - as done in CASTOR): Manualy disable some files on a problematic tape.
With this option, operator can manually disable certain files on a tape which he/she knows are problematic. The cta-taped will then skip over those files. The already queued requests for those files will be failed. The idea is that the operator starts with large range which he/she will then shrink to leave only bad files on the tape in the end.
Example: A tape has fSEQs from 1 until 10000, the bad area is around files from 6500-7000, the operator can disable fSEQs 6500-7000 and re-launch repack. After all other files are read, the operator will re-enable files 6800-7000 and again re-try repack. He/she will repeat the previous operations multiple times until it is identified that files 6500-6510 and 6700-6750 are impossible to retrieve (= lost).
Option 2 (new - for CTA): Do not retry when repacking.
If a tape is in REPACKING state, cta-taped will ignore the re-try logic and simply skip to the next file. This means that with one quick repack retrieve pass, most of the readable files are quickly read. Those files that couldn't be retrieved in one pass will be retrieved in a subsequent repack operation that the operator will try using a different tape drive. This is to be repeated until for example 3 different drives are not able to read the remining files.
At the end of either option 1 or option 2, the operator has to change the approach and use a different script to try to extract the problematic files (at CERN this is done using script tape-extract) but the hope is that the number of those problematic files is low as this is a manual operation.
The above mentioned options should also be reviewed by external sites, in particular by @timkrtch from DESY as expressed in the ticket #124 (closed).
While it was initially suggested that the option 1 is implemented, it is clear that option 2 is a lot simpler. That is why the preferred solution at CERN is option 2.
We discussed these two options in the dev meeting of 16/12/2022, with the following conclusions:
Both options are not mutually exclusive.
However, option #2 (do not retry when repacking) is much simpler to implement and operate, while option #1 (manualy disable some files on a problematic tape) is more complex and requires changing the catalogue.
Therefore, we will implement option #2, but will keep discussing with our external collaborators if option #1 is also necessary.
If we are repacking a problematic tape, we want the data to be safe = to end up on a good tape. If the new tape has a problem during write, we want to spot this as soon as possible.
So YES, removing retry during repack archive is fine. If we see massive problems with this (for example we are unable to repack data because of several new tapes are bad, we will have to revisit this).
Idealy, all those parameters (number of retries during the same mount for user and repack archive/retrieve) could be configurable in cta-taped.conf, but that is a different story.
In this case I will keep most of the logic as it is, which will result in the repack archive having the same number of retries as the repack retrieve (zero).
In case this becomes problematic (or if we need to be configure it in cta-taped.conf), we can create a new issue to cover that new use case.