Do not queue file for retrieve if EOS size/checksum do not match CTA Catalogue
Summary
CTA do not check EOS metadata <-> CTA catalogue metadata consistency when queuing a file for retrieve.
When the file is written to eos disk inconsistent metadata will trigger a delete on close on eos side and the file will never be written to disk.
Even worse: the tape containing this file will accumulate failed sessions and error because of this file. The tape will then be repacked with no problem and the repackked tape will have the same issue...
Verification would lead to similar issue as this process is another tape side only check.
CTA software needs to perform more cross checks between EOS metadata and the CTA catalogue and fail much earlier...
Steps to reproduce
@jleduc reproduced this in CI:
- archive a file
- change its checksum in EOS using
eos-ns-inspect change-fid ... --new-checksum 000000 ...
- clear eos namespace cache
-
eos ns cache drop -f
andeos ns cache drop -d
-
- recall it...
What is the current bug behavior?
The file is retrieved, injected in EOS, it is deleted on close.
Tape drive stays in DrainToDisk
state for close to 15 minutes and tape session fails with the following information:
Jul 30 13:34:56.286819 tpsrv01 cta-taped: LVL="INFO" PID="394" TID="394" MSG="Tape session finished"
tapeVid="V02001" mountType="Retrieve" mountId="25" tapeDrive="VDSTK21" vendor="vendor" volReqId="25" vo="vo"
mediaType="T10K500G" tapePool="ctasystest" logicalLibrary="VDSTK21" capacityInBytes="500000000000"
stillOpenFileForThread0="root://ctaeos.toto.svc.cluster.local//eos/ctaeos/preprod/test1?eos.lfn=fxid:100002719&
eos.ruid=0&eos.rgid=0&eos.injection=1&eos.workflow=retrieve_written&eos.space=default&oss.asize=394"
wasTapeMounted="1" mountTime="0.932477" positionTime="0.006280" waitInstructionsTime="0.155515"
waitFreeMemoryTime="0.000006" waitDataTime="0.000000" waitReportingTime="0.001812" checksumingTime="0.000000"
readWriteTime="0.004242" flushTime="0.000000" unloadTime="0.007018" unmountTime="1.009817"
encryptionControlTime="0.003925" transferTime="0.161575" totalTime="1.968544" deliveryTime="886.480280"
drainingTime="884.511736" dataVolume="394" filesCount="1" headerVolume="480" payloadTransferSpeedMBps="0.000200"
driveTransferSpeedMBps="0.000444" repackFilesCount="0" userFilesCount="1" verifiedFilesCount="0"
repackBytesCount="0" userBytesCount="394" verifiedBytesCount="0" Error_sessionKilled="1" killSignal="9"
status="failure"
As the drive process was killed the garbage collector requeue the request without increasing its counter and it is requeued again for another round forever.
No failed request is ever created for these files.
What is the expected correct behavior?
It should not be looping endlessly and should finish in failed requests for analysis.
Injection in eos diskserver should work even if the injected checksum is wrong: we should only check that the checksum of the content injected in the FST matches what is in the CTA namespace.
EOS side consistency check is part of EOS operations.
Relevant logs and/or screenshots
See operations#428