Tape server can only deal with archival errors that occur before writing the first block
Summary
When eos read fails on the tape server the retry logic is different depending on the type of error encountered.
Steps to reproduce
Bitflips are the worst ones: only 1 file is detected per batch and then the full batch is requeued. Meaning that if 5 files have bitflips (same size, same location but different checksum) the full batch is requeued 5 times (times mount retries)...
The behaviour is different if the file size is different from expected or if the disk replica does not exist. In this case all files in error are listed in 1 or 2 mounts.
Bitflipping 10 bytes in a file from 11 to 20th:
dd if=/dev/zero of=/data1113/00bbbbb/aaaaaaa bs=1 seek=10 count=10 conv=notrunc
Same size but different checksum...
What is the current bug behavior?
Too many requeues of way too many files... I had to truncate the file to limit the amount of retries and cut the initial test expectations as it was taking way too long in requeuing.
What is the expected correct behavior?
Same behaviour for all read errors upon archival: wrong size, no file or bitflip should have the same retry mechanism: 3 tries per session and 2 mount sessions if my memory is OK.
Relevant logs and/or screenshots
https://meter-cta.web.cern.ch/d/ZxPmpXOWk/errors?orgId=1&from=1613469585371&to=1613472747600
and also cta-admin --json fr ls -l
for archivetest files in atlas instance:
[
"/eos/ctaatlas/archivetest/data15_13TeV/physics_EnhancedBias/00267638/data15_13TeV.00267638.physics_EnhancedBias.recon.HIST.r6855_tid05763091_00/HIST.05763091._001421.pool.root.1",
"/eos/ctaatlas/archivetest/data15_13TeV/physics_EnhancedBias/00267638/data15_13TeV.00267638.physics_EnhancedBias.recon.HIST.r6855_tid05763091_00/HIST.05763091._000600.pool.root.1",
"/eos/ctaatlas/archivetest/data15_13TeV/physics_EnhancedBias/00267638/data15_13TeV.00267638.physics_EnhancedBias.recon.HIST.r6855_tid05763091_00/HIST.05763091._000607.pool.root.1",
"/eos/ctaatlas/archivetest/data15_13TeV/physics_EnhancedBias/00267638/data15_13TeV.00267638.physics_EnhancedBias.recon.HIST.r6855_tid05763091_00/HIST.05763091._001410.pool.root.1",
"/eos/ctaatlas/archivetest/data15_13TeV/physics_EnhancedBias/00267638/data15_13TeV.00267638.physics_EnhancedBias.recon.HIST.r6855_tid05763091_00/HIST.05763091._001752.pool.root.1",
"/eos/ctaatlas/archivetest/data15_13TeV/physics_EnhancedBias/00267638/data15_13TeV.00267638.physics_EnhancedBias.recon.HIST.r6855_tid05763091_00/HIST.05763091._001862.pool.root.1",
"/eos/ctaatlas/archivetest/data15_13TeV/physics_EnhancedBias/00267638/data15_13TeV.00267638.physics_EnhancedBias.recon.HIST.r6855_tid05763091_00/HIST.05763091._001054.pool.root.1",
"/eos/ctaatlas/archivetest/data15_13TeV/physics_EnhancedBias/00267638/data15_13TeV.00267638.physics_EnhancedBias.recon.HIST.r6855_tid05763091_00/HIST.05763091._001706.pool.root.1",
"/eos/ctaatlas/archivetest/data15_13TeV/physics_EnhancedBias/00267638/data15_13TeV.00267638.physics_EnhancedBias.recon.HIST.r6855_tid05763091_00/HIST.05763091._001751.pool.root.1",
"/eos/ctaatlas/archivetest/data15_13TeV/physics_EnhancedBias/00267638/data15_13TeV.00267638.physics_EnhancedBias.recon.HIST.r6855_tid05763091_00/HIST.05763091._001257.pool.root.1"
]