Investigation on failure of eviction tests
Link to JIRA ticket: https://its.cern.ch/jira/projects/EOS/issues/EOS-6112
Summary
Recently another test started failing frequently, and in combination with it the one failing in retrieve_queue_cleanup.sh
#616 it can take up to 6+ executions of the client systest to be completed successfully...
A little bit of context, when doing the refactoring of the system tests I was using at the same time -I and -n X
in xargs
, -I
replaces only one input argument and -n X
allows the command to take up to X input arguments at a time. These flags are mutually exclusive, so, xargs
was ignoring the -n command and issuing one request to the mgm per file. It was failing from time to time (although if I am not mistaking this was also failing before the change, but I cannot prove it it just happend in CI with the previous configuration link to job). Recently, without any other changes to the code involved in this, the amount of failures in this part of the test has become outrageous. For now, we will move back to 40 files per quests as in the past and see if this fixes or improves the problem.
Problem description
The problem is faced during the eviction stage. The command doesn't return an error when executing it but the request fails on the EOS side.
What we can see in the logs are:
WFE (note that the Failed to issue evict is logged as INFO not as an ERROR):
240402 13:54:14 INFO WFE:1620 default SYNC::EVICT_PREPARE /eos/ctaeos/preprod/1cf1bf0a-9bc5-4aa5-a7f8-41ec31f370a8/0/04359 ctafrontend:10955 fxid=0000111d mgm.reqid="*"
240402 13:54:14 INFO WFE:2182 fxid=111d file=/eos/ctaeos/preprod/1cf1bf0a-9bc5-4aa5-a7f8-41ec31f370a8/0/04359 msg="Failed to issue evict for evict_prepare event"
The message from the WFE can also be seen for other files but those appear as evicted in the mgm:
[poliverc@hp-eb-poliverc mgm]$ cat WFE.log | grep 'Failed to issue'
240402 13:22:07 INFO WFE:2182 fxid=15b file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/00331 msg="Failed to issue evict for evict_prepare event"
240402 13:22:11 INFO WFE:2182 fxid=856 file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/02129 msg="Failed to issue evict for evict_prepare event"
240402 13:22:11 INFO WFE:2182 fxid=8e9 file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/02239 msg="Failed to issue evict for evict_prepare event"
240402 13:22:20 INFO WFE:2182 fxid=1ad4 file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/06847 msg="Failed to issue evict for evict_prepare event"
240402 13:22:21 INFO WFE:2182 fxid=1bd6 file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/07093 msg="Failed to issue evict for evict_prepare event"
240402 13:22:22 INFO WFE:2182 fxid=1ec6 file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/07859 msg="Failed to issue evict for evict_prepare event"
240402 13:22:24 INFO WFE:2182 fxid=2212 file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/08712 msg="Failed to issue evict for evict_prepare event"
xrdlog.mgm
240402 13:54:14 ERROR [12001/01200] poweruser1 ::triggerPrepareWorkflow Unable to prepare - synchronous prepare workflow error /eos/ctaeos/preprod/1cf1bf0a-9bc5-4aa5-a7f8-41ec31f370a8/0/04359; Unable to trigger workflow - synchronous workflow failed /eos/ctaeos/preprod/1cf1bf0a-9bc5-4aa5-a7f8-41ec31f370a8/0/04359; Resource temporarily unavailable
I managed to reproduce the error in my dev machine this is what I can see for the file that failed:
240411 10:59:07 INFO WFE:2183 fxid=18f6 file=/eos/ctaeos/preprod/8055aa8c-515e-4e1a-9eef-91e4b4a9c5b4/0/06367 msg="Failed to issue evict for evict_prepare event"
[root@ctaeos /]# eos fileinfo /eos/ctaeos/preprod/8055aa8c-515e-4e1a-9eef-91e4b4a9c5b4/0/06367
File: '/eos/ctaeos/preprod/8055aa8c-515e-4e1a-9eef-91e4b4a9c5b4/0/06367' Flags: 0644
Size: 15360
Status: healthy
Modify: Thu Apr 11 10:28:57 2024 Timestamp: 1712824137.511384000
Change: Thu Apr 11 10:57:28 2024 Timestamp: 1712825848.391826037
Access: Thu Apr 11 10:28:57 2024 Timestamp: 1712824137.432184252
Birth: Thu Apr 11 10:28:57 2024 Timestamp: 1712824137.432183827
CUid: 11001 CGid: 1100 Fxid: 000018f6 Fid: 6390 Pid: 36 Pxid: 00000024
XStype: adler XS: c1 a8 fa 58 ETAGs: "1715302563840:c1a8fa58"
Layout: replica Stripes: 1 Blocksize: 4k LayoutId: 00100012 Redundancy: d1::t1
#Rep: 2
TapeID: 4294973737 StorageClass: ctaStorageClass
┌───┬──────┬────────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│ host│ schedgroup│ path│ boot│ configstatus│ drain│ active│ geotag│
└───┴──────┴────────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
0 65535 localhost tape.0 /does_not_exist off nodrain
1 1 ctaeos.cta.svc.cluster.local default.0 /fst booted rw nodrain online flat
*******
The files is still in the fst:
[root@ctaeos /]# ls -l /fst/00000000/000018f6
-rw-------. 1 daemon daemon 15360 Apr 11 10:58 /fst/00000000/000018f6
The root source of this problem seem to be during _dropallstripes
The debug logs did not show the printing of debug error messages in this part of the code because something else down the stack is modifying the value of the errno
variable.