Do not retry during repack requests (previous: handle 'unavailable' files in user and repack retrieves originated from problematic tapes)

added Object Store Scheduler + 1 deleted label

assigned to @afonso

marked this issue as related to #124 (closed)

mentioned in merge request !113 (merged)

As mentioned on #124 (comment 6248551), the flag INCLUDING DEFAULTS flag had to be added to the creation of the temporary table TEMP_TAPE_FILE_INSERTION_BATCH .

This is a pre-requisite for the catalogue schema change that adds the IS_ACCESSIBLE column to the TAPE_FILE table in postgres.

The flag INCLUDING DEFAULTS should be removed once the catalogue change has been deployed and merged to main. It can be replaced by directly referencing to the newly added column IS_ACCESSIBLE.

created branch 218-handle-unavailable-files-in-user-and-repack-retrieves-originated-from-problematic-tapes to address this issue

The new catalogue schema v13.0 has been merged to main and is waiting release.

The work requested by this issue is no longer blocked and can proceed.

Possible strategy 1

Pass a reference to the catalogue (&m_scheduler.getCatalogue()) as part of the constructor arguments of the RecallTaskInjector object. The catalogue is currently passed as an argument of RecallTaskInjector::initRAO but, by making it a class field field, we can make it available to all methods.
Extend the cta::RetrieveJob class by adding a new field is_unavailable. This field would be set inside of RecallTaskInjector::injectBulkRecalls (TBC?).
Fail the task in DiskWriteTask::execute(...).

Advantages:

Only checks file on the moment that we will be reading it. Disadvantage:
More complex.

Possible strategy 2

Check the catalogue immediately before the file is queued. If it's is_unavailable then fail the job immediately.

Advantages:

Simpler to implement. Disadvantages
Only checks the file before it is enqueued, which may be hours or days before the actual retrieve.

Open questions

How to avoid doing too many queries to the DB? Ideally, we should do a single query for all files in a single VID.

Some notes:

Retrieve requests (recalls) are managed and ordered by the tapeserver here: RecallTaskInjector::injectBulkRecalls()
Reads are executed here: castor::tape::tapeserver::daemon::DataTransferSession::executeRead()
Tasks are injected asynchrounously by the task injector, triggered in TapeReadSingleThread::popAndRequestMoreJobs().
Jobs are popped by the tapeserver with this line auto jobsList = m_retrieveMount.getNextJobBatch(reqFiles, reqSize, m_lc); on RecallTaskInjector.cpp:322.

Main question:

Q: When to check IS_ACCESSIBLE in the catalogue?

R1: Before inserting the job in the queue

By the cta-frontend, during job creation/insertion.
By the monitoring process (tape server), during garbage collection.
Requests that are inserted in the queue will fail during the tape read...

R2: When the job is popped by the queue:

By the tape server code, for every job, before trying to read the driver.
If we fail the job without doing anything else, we risk that it ends up requeued in the same VID...

R3: Combination of both:

Best of both worlds, but requires more SQL queries.

Personal opinion: Try R1 or R3

added Needs Discussion label

mentioned in issue #246 (closed)

mentioned in merge request !127 (merged)

marked this issue as related to #246 (closed)

As discussed in the recent CTA development meeting last week, we need to properly outline all possible scenarios how to handle a tape with problematic areas.

In case of a tape with problematic areas, the highest priority is to quickly move away (onto another tape) as many files as possible (= repack). The reason is that usually, there is a user waiting for majority of the files on that tape and the repack process shouldn't block the user more than necessary (i.e. one tape read + one tape write).

Currently, the cta-taped is designed to perform multiple re-tries when doing a file retrieve from a problematic tape. Multiple re-tries within the mount and then at least 2 different mounts. These retries are blocking the repack operation, hence introducing a significant delay for the user to retrieve quickly at least some portion of the good files. (Not to forget that re-tries are also putting stress on the drive and the tape media (which hopefully still contains lot of good files).)

In order to shorten this time, there are 2 possible option:

Option 1 (historical - as done in CASTOR): Manualy disable some files on a problematic tape.

With this option, operator can manually disable certain files on a tape which he/she knows are problematic. The cta-taped will then skip over those files. The already queued requests for those files will be failed. The idea is that the operator starts with large range which he/she will then shrink to leave only bad files on the tape in the end.

Example: A tape has fSEQs from 1 until 10000, the bad area is around files from 6500-7000, the operator can disable fSEQs 6500-7000 and re-launch repack. After all other files are read, the operator will re-enable files 6800-7000 and again re-try repack. He/she will repeat the previous operations multiple times until it is identified that files 6500-6510 and 6700-6750 are impossible to retrieve (= lost).

Option 2 (new - for CTA): Do not retry when repacking.

If a tape is in REPACKING state, cta-taped will ignore the re-try logic and simply skip to the next file. This means that with one quick repack retrieve pass, most of the readable files are quickly read. Those files that couldn't be retrieved in one pass will be retrieved in a subsequent repack operation that the operator will try using a different tape drive. This is to be repeated until for example 3 different drives are not able to read the remining files.

At the end of either option 1 or option 2, the operator has to change the approach and use a different script to try to extract the problematic files (at CERN this is done using script tape-extract) but the hope is that the number of those problematic files is low as this is a manual operation.

The above mentioned options should also be reviewed by external sites, in particular by @timkrtch from DESY as expressed in the ticket #124 (closed).

While it was initially suggested that the option 1 is implemented, it is clear that option 2 is a lot simpler. That is why the preferred solution at CERN is option 2.

We discussed these two options in the dev meeting of 16/12/2022, with the following conclusions:

Both options are not mutually exclusive.
However, option #2 (do not retry when repacking) is much simpler to implement and operate, while option #1 (manualy disable some files on a problematic tape) is more complex and requires changing the catalogue.
Therefore, we will implement option #2, but will keep discussing with our external collaborators if option #1 is also necessary.

RE: Option 2 Will there also be changes to the catalogue e.g on the tape_file table to identify a file as corrupted?

Option 2 does not imply changing the catalogue.

The corrupted files files are the ones that remain on tape after it has been repacked (after 1 or N manual retries, as decided by the operator).

mentioned in issue #37

marked this issue as related to #37

The option 2 suits DESY's needs, therefore we are OK with it.

created branch 218-handle-unavailable-files-in-user-and-repack-retrieves-originated-from-problematic-tapes-2 to address this issue

As seen in https://gitlab.cern.ch/cta/CTA/-/blob/main/objectstore/RetrieveRequest.cpp#L1278-L1280, the max number of retries from a repack archive sub-request is copied directly from the number of retries of the repack retrieve sub-request that preceded it.

This means that the re-try logic will be removed from both the retrieve and archive repack jobs (unless I change the logic).

@vlado is this the behaviour that we want in option #2? Or do we just want to remove the retry logic from repack retrieve requests?

Joao,

thanks for bring this up.

If we are repacking a problematic tape, we want the data to be safe = to end up on a good tape. If the new tape has a problem during write, we want to spot this as soon as possible.

So YES, removing retry during repack archive is fine. If we see massive problems with this (for example we are unable to repack data because of several new tapes are bad, we will have to revisit this).

Idealy, all those parameters (number of retries during the same mount for user and repack archive/retrieve) could be configurable in cta-taped.conf, but that is a different story.

Vlado

Ok, thank you for the clarification.

In this case I will keep most of the logic as it is, which will result in the repack archive having the same number of retries as the repack retrieve (zero).

In case this becomes problematic (or if we need to be configure it in cta-taped.conf), we can create a new issue to cover that new use case.

changed the description

Do not retry during repack requests (previous: handle 'unavailable' files in user and repack retrieves originated from problematic tapes)

Designs

Child items ...

Activity

Possible strategy 1

Possible strategy 2

Open questions

Option 1 (historical - as done in CASTOR): Manualy disable some files on a problematic tape.

Option 2 (new - for CTA): Do not retry when repacking.

Do not retry during repack requests (previous: handle 'unavailable' files in user and repack retrieves originated from problematic tapes)

Relates to

Activity

Possible strategy 1

Possible strategy 2

Open questions

Option 1 (historical - as done in CASTOR): Manualy disable some files on a problematic tape.

Option 2 (new - for CTA): Do not retry when repacking.