Add URL of disk in disk write thread messages
Following operations#1476 cryptic title I volunteered to open the dev ticket.
In this operations ticket one file system in eoscta was in a fishy state: healthy enough for the MGM to route disk writes to it, but not healthy enough to write to the disk.
We need to get the failing disk URL which was eosctafst0110.cern.ch:1110
in this case to be able to correlate eoscta filesystems in bad states
Here is the current misleading error we are getting:
{
"epoch_time": 1723355903.439226260,
"local_time": "2024-08-11T07:58:23+0200",
"hostname": "tpsrv474",
"program": "cta-taped",
"log_level": "ERROR",
"pid": "3695675",
"tid": "3695953",
"message": "File writing to disk failed",
"drive_name": "IBMLIB4-TS1170-F02C2R4",
"instance": "ctaproduction",
"sched_backend": "cephProductionUser",
"thread": "DiskWrite",
"tapeDrive": "IBMLIB4-TS1170-F02C2R4",
"tapeVid": "I00942",
"mountId": "1412895",
"vo": "ATLAS",
"tapePool": "r_atlas_raw",
"threadCount": 10,
"threadID": 0,
"fileId": 4585091197,
"dstURL": "root://eosctaatlas.cern.ch//eos/ctaatlas/archive/grid/atlas/rucio/raw/data23_13p6TeV/physics_Main/00453858/data23_13p6TeV.00453858.physics_Main.daq.RAW/data23_13p6TeV.00453858.physics_Main.daq.RAW._lb0281._SFO-19._0004.data?eos.lfn=fxid:100daabb0&eos.ruid=0&eos.rgid=0&eos.injection=1&eos.workflow=retrieve_written&eos.space=retrieve&oss.asize=798624572",
"fSeq": 15530,
"errorMessage": "In XrootWriteFile::XrootWriteFile failed XrdCl::File::Open() on root://eosctaatlas.cern.ch//eos/ctaatlas/archive/grid/atlas/rucio/raw/data23_13p6TeV/physics_Main/00453858/data23_13p6TeV.00453858.physics_Main.daq.RAW/data23_13p6TeV.00453858.physics_Main.daq.RAW._lb0281._SFO-19._0004.data?eos.lfn=fxid:100daabb0&eos.ruid=0&eos.rgid=0&eos.injection=1&eos.workflow=retrieve_written&eos.space=retrieve&oss.asize=798624572 [ERROR] Server responded with an error: [3010] Opening relative path 'puli/Ri/zb+xEjwDfAbiyvsi/KJ0rF7v9DYXsvN+VijJZm30l+41yYFi8PrvB65nEDX0cn+1MW6fAflEx/I2vy/ShOIdRZwAt1sPRwksHZMruYEiGm3xGbcAWf+fOcd9VjpaPlsSR2lng7hn+1MANyut/08UYmgtnLvHqIsiso9Ut04ZHUq+eqQucflEdQYOt0cTKrBVLcDtxUNCjBGyvOxWKe7Vz2/55VcFD5Xjnzh5NbpCBv8mDMO4d7vgCqygcA9RoQuM1cZe6AsiieilCgXxz+JJAu6ifS5HsAsVQIngCz6TJywwNNearJbaZ9TzWiZI9DoNAxtygBoRFSBiC8tssNtE8jwdEDA4FejpoVtrFnyhHOQs8lWits69Gi0S2rYK/G5KBXbGlPzukMUMYexPJB4zCrGnWq+Bhn1HdB2e+rEs6GyLpO54xEQ6ym7WT3BmHIb1IuqoVZbqPvRDgo7r1aqN2uCsKO27Mx+zFuhMpPiyqPPln+ptGmaR01H2m+XUXomcvQanO0kgQzf1twjmJ/oE/wencOLv61HeJrmwNkGiM2LZw0fQJwXD5TOWBzgLbqQZrVji7+8XLgG0GingadoO17defCUfNEf9UjkADTwszqfb6UD2C6TfnvLBBv+WYJCOx0w==&mgm.logid=b6cc00f6-57a6-11ef-a74f-b8599f55d950&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=100daabb0&mgm.event=sync::closew&mgm.workflow=retrieve_written&mgm.instance=eosctaatlas&mgm.owner_uid=10763&mgm.owner_gid=1307&mgm.requestor=root&mgm.requestorgroup=root&mgm.attributes=Q1RBX1N0b3JhZ2VDbGFzcz1taWdyYXRpb247OztDVEFfVGFwZUZzSWQ9NjU1MzU7OztzeXMuYXJjaGl2ZS5lcnJvcj07OztzeXMuYXJjaGl2ZS5maWxlX2lkPTQ1ODUwOTExOTc7OztzeXMuYXJjaGl2ZS5zdG9yYWdlX2NsYXNzPWF0bGFzX3Jhdzs7O3N5cy5jdGEuYXJjaGl2ZS5vYmplY3RzdG9yZS5pZD07OztzeXMuY3RhLm9iamVjdHN0b3JlLmlkPVJldHJpZXZlUmVxdWVzdC1Gcm9udGVuZC1jdGFwcm9kdWN0aW9uZnJvbnRlbmQxMi5jZXJuLmNoLTEyMzE0Ni0yMDI0MDgwMy0xOTo1ODo0NS0wLTQyNjMyOTY7OztzeXMuZW9zLmJ0aW1lPTE2ODYwOTU0OTIuNzU3MzA0MzM2&eos.clientinfo=zbase64:MDAwMDAwNTR4nBXIQQqAMAwF0at4AjELN4UcxrY/GJC2xBTx9prdm+kDLZkzrbQv2lN+WJuE7nKi8hb0d4DzFIGhxhimnSvkmJdH/1sNxTkR0fYB2IQdQA==:1110/?eos.injection=1&eos.lfn=fxid:100daabb0&eos.rgid=0&eos.ruid=0&eos.space=retrieve&eos.workflow=retrieve_written&oss.asize=798624572&tried=&cap.sym=D1Zd8NdzN2H1jxyTB26rG3S7WzM=&cap.msg=BR6RfvSd1Syr9MdJK4Xf5180BN7tLQ5wX3lNApSKfq&triedrc=srverr' is disallowed. code:400 errNo:3010 status:1",
"readWriteTime": 0.0,
"checksumingTime": 0.0,
"waitDataTime": 86.113383,
"waitReportingTime": 1.986817,
"checkingErrorTime": 0.000008,
"openingTime": 0.0,
"closingTime": 0.0,
"transferTime": 0.0,
"totalTime": 0.0,
"dataVolume": 0,
"globalPayloadTransferSpeedMBps": 0.0,
"diskPerformanceMBps": 0.0,
"openRWCloseToTransferTimeRatio": 0.0
}
ERROR Opening relative path
but injection is using fxid:
therefore no path at all, which means this error is plain wrong and misleading.
Operations needs to be able to correlate failed disk writes when they happen and for this we need the disk write URL.
This string is in the message: eos.clientinfo=zbase64:MDAwMDAwNTR4nBXIQQqAMAwF0at4AjELN4UcxrY/GJC2xBTx9prdm+kDLZkzrbQv2lN+WJuE7nKi8hb0d4DzFIGhxhimnSvkmJdH/1sNxTkR0fYB2IQdQA==:1110
seems to be correlated with FS URL as it has the same port, but I have no idea how this was encoded...
Maybe it already contains what we need, but we need this in clear text in a dedicated field in the logs anyway.