Investigate an archiveretrieve-eos5 failure
Hi. I happened to notice that 1 retrieve out of 10000 didn't happen during the archiveretrieve-eos5 test of pipeline 4385568:
e.g. from tpsrv01-taped.log
Aug 19 13:06:14.212340 tpsrv01 cta-taped: LVL="ERROR" PID="9220" TID="10326" MSG="File writing to disk failed" thread="DiskWrite" tapeDrive="VDSTK31" tapeVid="V03001" mountId="9" threadCount="10" threadID="4" fileId="4294971556" dstURL="root://ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local//eos/ctaeos/preprod/b6d8db5c-2a5b-4323-94cd-79fe84e2e830/0/test00004250?eos.lfn=fxid:10ad&eos.ruid=0&eos.rgid=0&eos.injection=1&eos.workflow=retrieve_written&eos.space=default&oss.asize=15360" fSeq="4257" errorMessage="In XrootWriteFile::XrootWriteFile failed XrdCl::File::Open() on root://ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local//eos/ctaeos/preprod/b6d8db5c-2a5b-4323-94cd-79fe84e2e830/0/test00004250?eos.lfn=fxid:10ad&eos.ruid=0&eos.rgid=0&eos.injection=1&eos.workflow=retrieve_written&eos.space=default&oss.asize=15360 [FATAL] Redirect limit has been reached code:306 errNo:0 status:3" readWriteTime="0.000000" checksumingTime="0.000000" waitDataTime="0.129317" waitReportingTime="0.138888" checkingErrorTime="0.000002" openingTime="0.000000" closingTime="0.000000" transferTime="0.000000" totalTime="0.000000" dataVolume="0" globalPayloadTransferSpeedMBps="0.000000" diskPerformanceMBps="0.000000" openRWCloseToTransferTimeRatio="0.000000"
Using the pipeline artificats (logs) and comparing entries from fst/xrdlog.fst mentioning that ID (0x10ad) with those containing some other ID (e.g. 0x10ac):
Fid 10ac GOOD (only start of each log line shown)
220819 13:02:43 time=1660906963.312406 func=open level=INFO logid=73060f38-1fae-11ed-a6aa-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8426bf5700 source=XrdFstOfsFile:135
220819 13:02:43 time=1660906963.312569 func=ProcessCapOpaque level=INFO logid=73060f38-1fae-11ed-a6aa-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8426bf5700 source=XrdFstOfsFile:2645
220819 13:02:43 time=1660906963.312606 func=open level=INFO logid=73060f38-1fae-11ed-a6aa-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8426bf5700 source=XrdFstOfsFile:214
220819 13:02:43 time=1660906963.312749 func=open level=INFO logid=73060f38-1fae-11ed-a6aa-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8426bf5700 source=XrdFstOfsFile:534
220819 13:02:43 time=1660906963.312776 func=fileOpen level=INFO logid=7309851e-1fae-11ed-8d16-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8426bf5700 source=LocalIo:70
220819 13:02:43 time=1660906963.312984 func=open level=INFO logid=73060f38-1fae-11ed-a6aa-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8426bf5700 source=XrdFstOfsFile:710
220819 13:02:43 time=1660906963.312998 func=stat level=INFO logid=73060f38-1fae-11ed-a6aa-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8426bf5700 source=XrdFstOfsFile:1222
220819 13:02:43 time=1660906963.630957 func=Report level=INFO logid=static.............................. unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84397fd700 source=Report:49
220819 13:03:32 time=1660907012.305211 func=open level=INFO logid=903cc8e4-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:135
220819 13:03:32 time=1660907012.305349 func=ProcessCapOpaque level=INFO logid=903cc8e4-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:2645
220819 13:03:32 time=1660907012.305397 func=open level=INFO logid=903cc8e4-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:214
220819 13:03:32 time=1660907012.305551 func=open level=INFO logid=903cc8e4-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:534
220819 13:03:32 time=1660907012.305579 func=fileOpen level=INFO logid=903d3a86-1fae-11ed-b7da-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=LocalIo:70
220819 13:03:32 time=1660907012.305761 func=open level=INFO logid=903cc8e4-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:710
220819 13:03:32 time=1660907012.305784 func=stat level=INFO logid=903cc8e4-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:1222
220819 13:03:32 time=1660907012.352892 func=VerifyChecksum level=INFO logid=903cc8e4-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:3516
220819 13:03:32 time=1660907012.352954 func=VerifyChecksum level=INFO logid=903cc8e4-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:3516
220819 13:03:32 time=1660907012.353914 func=VerifyChecksum level=INFO logid=903cc8e4-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:3516
220819 13:03:32 time=1660907012.875123 func=Report level=INFO logid=static.............................. unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84397fd700 source=Report:49
220819 13:03:43 time=1660907023.317153 func=MgmSyncer level=INFO logid=static.............................. unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8436bfe700 source=MgmSyncer:94
220819 13:03:45 time=1660907025.203553 func=_rem level=INFO logid=109f33b0-1fae-11ed-9194-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84399fe700 source=XrdFstOfs:1242
220819 13:06:14 time=1660907174.049148 func=open level=INFO logid=f0a4dfaa-1fae-11ed-9d59-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=XrdFstOfsFile:135
220819 13:06:14 time=1660907174.049319 func=ProcessCapOpaque level=INFO logid=f0a4dfaa-1fae-11ed-9d59-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=XrdFstOfsFile:2645
220819 13:06:14 time=1660907174.049357 func=open level=INFO logid=f0a4dfaa-1fae-11ed-9d59-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=XrdFstOfsFile:214
220819 13:06:14 time=1660907174.049568 func=open level=INFO logid=f0a4dfaa-1fae-11ed-9d59-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=XrdFstOfsFile:534
220819 13:06:14 time=1660907174.049599 func=fileOpen level=INFO logid=f0a5683a-1fae-11ed-8632-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=LocalIo:70
220819 13:06:14 time=1660907174.049816 func=open level=INFO logid=f0a4dfaa-1fae-11ed-9d59-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=XrdFstOfsFile:710
220819 13:06:14 time=1660907174.049830 func=stat level=INFO logid=f0a4dfaa-1fae-11ed-9d59-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=XrdFstOfsFile:1222
220819 13:06:14 time=1660907174.236681 func=Report level=INFO logid=static.............................. unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84397fd700 source=Report:49
220819 13:07:14 time=1660907234.644407 func=MgmSyncer level=INFO logid=static.............................. unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8436bfe700 source=MgmSyncer:94
[ these are all the lines ]
with 10ad BAD (only start of each log line shown)
220819 13:02:43 time=1660906963.317464 func=open level=INFO logid=73074420-1fae-11ed-8bd0-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8425dee700 source=XrdFstOfsFile:135
220819 13:02:43 time=1660906963.317641 func=ProcessCapOpaque level=INFO logid=73074420-1fae-11ed-8bd0-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8425dee700 source=XrdFstOfsFile:2645
220819 13:02:43 time=1660906963.317674 func=open level=INFO logid=73074420-1fae-11ed-8bd0-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8425dee700 source=XrdFstOfsFile:214
220819 13:02:43 time=1660906963.317837 func=open level=INFO logid=73074420-1fae-11ed-8bd0-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8425dee700 source=XrdFstOfsFile:534
220819 13:02:43 time=1660906963.317862 func=fileOpen level=INFO logid=730a4bca-1fae-11ed-a4f7-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8425dee700 source=LocalIo:70
220819 13:02:43 time=1660906963.318056 func=open level=INFO logid=73074420-1fae-11ed-8bd0-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8425dee700 source=XrdFstOfsFile:710
220819 13:02:43 time=1660906963.318070 func=stat level=INFO logid=73074420-1fae-11ed-8bd0-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8425dee700 source=XrdFstOfsFile:1222
220819 13:02:43 time=1660906963.655045 func=Report level=INFO logid=static.............................. unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84397fd700 source=Report:49
220819 13:03:32 time=1660907012.315962 func=open level=INFO logid=903e83a0-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:135
220819 13:03:32 time=1660907012.316164 func=ProcessCapOpaque level=INFO logid=903e83a0-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:2645
220819 13:03:32 time=1660907012.316207 func=open level=INFO logid=903e83a0-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:214
220819 13:03:32 time=1660907012.316335 func=open level=INFO logid=903e83a0-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:534
220819 13:03:32 time=1660907012.316361 func=fileOpen level=INFO logid=903edfb2-1fae-11ed-b7da-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=LocalIo:70
220819 13:03:32 time=1660907012.316612 func=open level=INFO logid=903e83a0-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:710
220819 13:03:32 time=1660907012.316638 func=stat level=INFO logid=903e83a0-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:1222
220819 13:03:32 time=1660907012.375104 func=VerifyChecksum level=INFO logid=903e83a0-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:3516
220819 13:03:32 time=1660907012.375172 func=VerifyChecksum level=INFO logid=903e83a0-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:3516
220819 13:03:32 time=1660907012.376028 func=VerifyChecksum level=INFO logid=903e83a0-1fae-11ed-a7b4-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8452ffd700 source=XrdFstOfsFile:3516
220819 13:03:32 time=1660907012.876884 func=Report level=INFO logid=static.............................. unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84397fd700 source=Report:49
220819 13:03:43 time=1660907023.320115 func=MgmSyncer level=INFO logid=static.............................. unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f8436bfe700 source=MgmSyncer:94
220819 13:03:43 time=1660907023.321539 func=_rem level=INFO logid=109f33b0-1fae-11ed-9194-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84399fe700 source=XrdFstOfs:1242
220819 13:06:14 time=1660907174.076761 func=open level=INFO logid=f0a91be2-1fae-11ed-9d59-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=XrdFstOfsFile:135
220819 13:06:14 time=1660907174.076876 func=ProcessCapOpaque level=INFO logid=f0a91be2-1fae-11ed-9d59-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=XrdFstOfsFile:2645
220819 13:06:14 time=1660907174.076903 func=open level=INFO logid=f0a91be2-1fae-11ed-9d59-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=XrdFstOfsFile:214
220819 13:06:14 time=1660907174.076990 func=LocalGetFmd level=CRIT logid=CommonFmdDbMapHandler unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=FmdDbMap:411
220819 13:06:14 time=1660907174.079113 func=LocalGetFmd level=CRIT logid=CommonFmdDbMapHandler unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=FmdDbMap:411
220819 13:06:14 time=1660907174.079180 func=LocalGetFmd level=CRIT logid=CommonFmdDbMapHandler unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=FmdDbMap:411
220819 13:06:14 time=1660907174.079227 func=ResyncDisk level=ERROR logid=CommonFmdDbMapHandler unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=FmdDbMap:893
220819 13:06:14 time=1660907174.079240 func=open level=ERROR logid=f0a91be2-1fae-11ed-9d59-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=XrdFstOfsFile:423
220819 13:06:14 time=1660907174.079245 func=open level=ERROR logid=f0a91be2-1fae-11ed-9d59-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=XrdFstOfsFile:433
220819 13:06:14 time=1660907174.083127 func=open level=INFO logid=f0aa39f0-1fae-11ed-9d59-02420afe3b07 unit=fst@ctaeos.ar-eos5-4385568git70a4df35-zugs.svc.cluster.local:1095 tid=00007f84201c0700 source=XrdFstOfsFile:135
[... several hundred more not shown ...]
the two sets start with similar entries, until CRIT level message from LocalGetFmd for the BAD file. I think it may be a race condition inside EOS fst between removing a file (_rem) and the periodic synchronisation of filemetadata synchronisation with mgm (MgmSyncer). Will try to understand if this is the case and show the sequence or try to propose a change on EOS tracker.
I'm not sure if this is actually important - or so rare and unlikely-to happen in production that it is noteworthy or not. But might be helpful to understand or fix it can be (relatively quickly) understood.