Investigation on failure of eviction tests

Link to JIRA ticket: https://its.cern.ch/jira/projects/EOS/issues/EOS-6112

Summary

Recently another test started failing frequently, and in combination with it the one failing in retrieve_queue_cleanup.sh #616 (closed) it can take up to 6+ executions of the client systest to be completed successfully...

A little bit of context, when doing the refactoring of the system tests I was using at the same time -I and -n X in xargs, -I replaces only one input argument and -n X allows the command to take up to X input arguments at a time. These flags are mutually exclusive, so, xargs was ignoring the -n command and issuing one request to the mgm per file. It was failing from time to time (~~although if I am not mistaking this was also failing before the change, but I cannot prove it~~ it just happend in CI with the previous configuration link to job). Recently, without any other changes to the code involved in this, the amount of failures in this part of the test has become outrageous. For now, we will move back to 40 files per quests as in the past and see if this fixes or improves the problem.

Problem description

The problem is faced during the eviction stage. The command doesn't return an error when executing it but the request fails on the EOS side.

What we can see in the logs are:

WFE (note that the Failed to issue evict is logged as INFO not as an ERROR):

240402 13:54:14 INFO  WFE:1620                       default SYNC::EVICT_PREPARE /eos/ctaeos/preprod/1cf1bf0a-9bc5-4aa5-a7f8-41ec31f370a8/0/04359 ctafrontend:10955 fxid=0000111d mgm.reqid="*" 
240402 13:54:14 INFO  WFE:2182                       fxid=111d file=/eos/ctaeos/preprod/1cf1bf0a-9bc5-4aa5-a7f8-41ec31f370a8/0/04359 msg="Failed to issue evict for evict_prepare event"

The message from the WFE can also be seen for other files but those appear as evicted in the mgm:

[poliverc@hp-eb-poliverc mgm]$ cat WFE.log | grep 'Failed to issue'
240402 13:22:07 INFO  WFE:2182                       fxid=15b file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/00331 msg="Failed to issue evict for evict_prepare event" 
240402 13:22:11 INFO  WFE:2182                       fxid=856 file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/02129 msg="Failed to issue evict for evict_prepare event" 
240402 13:22:11 INFO  WFE:2182                       fxid=8e9 file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/02239 msg="Failed to issue evict for evict_prepare event" 
240402 13:22:20 INFO  WFE:2182                       fxid=1ad4 file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/06847 msg="Failed to issue evict for evict_prepare event" 
240402 13:22:21 INFO  WFE:2182                       fxid=1bd6 file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/07093 msg="Failed to issue evict for evict_prepare event" 
240402 13:22:22 INFO  WFE:2182                       fxid=1ec6 file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/07859 msg="Failed to issue evict for evict_prepare event" 
240402 13:22:24 INFO  WFE:2182                       fxid=2212 file=/eos/ctaeos/preprod/e6a1e2ce-ee56-4f1e-a559-ed8076ff167e/0/08712 msg="Failed to issue evict for evict_prepare event"

xrdlog.mgm

240402 13:54:14 ERROR [12001/01200]       poweruser1 ::triggerPrepareWorkflow Unable to prepare - synchronous prepare workflow error /eos/ctaeos/preprod/1cf1bf0a-9bc5-4aa5-a7f8-41ec31f370a8/0/04359; Unable to trigger workflow - synchronous workflow failed /eos/ctaeos/preprod/1cf1bf0a-9bc5-4aa5-a7f8-41ec31f370a8/0/04359; Resource temporarily unavailable

I managed to reproduce the error in my dev machine this is what I can see for the file that failed:

240411 10:59:07 INFO  WFE:2183                       fxid=18f6 file=/eos/ctaeos/preprod/8055aa8c-515e-4e1a-9eef-91e4b4a9c5b4/0/06367 msg="Failed to issue evict for evict_prepare event"

[root@ctaeos /]# eos fileinfo /eos/ctaeos/preprod/8055aa8c-515e-4e1a-9eef-91e4b4a9c5b4/0/06367
  File: '/eos/ctaeos/preprod/8055aa8c-515e-4e1a-9eef-91e4b4a9c5b4/0/06367'  Flags: 0644
  Size: 15360
Status: healthy
Modify: Thu Apr 11 10:28:57 2024 Timestamp: 1712824137.511384000
Change: Thu Apr 11 10:57:28 2024 Timestamp: 1712825848.391826037
Access: Thu Apr 11 10:28:57 2024 Timestamp: 1712824137.432184252
 Birth: Thu Apr 11 10:28:57 2024 Timestamp: 1712824137.432183827
  CUid: 11001 CGid: 1100 Fxid: 000018f6 Fid: 6390 Pid: 36 Pxid: 00000024
XStype: adler    XS: c1 a8 fa 58    ETAGs: "1715302563840:c1a8fa58"
Layout: replica Stripes: 1 Blocksize: 4k LayoutId: 00100012 Redundancy: d1::t1 
  #Rep: 2
TapeID: 4294973737 StorageClass: ctaStorageClass
┌───┬──────┬────────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                        host│      schedgroup│            path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴────────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0    65535                    localhost           tape.0  /does_not_exist                       off      nodrain                                   
 1        1 ctaeos.cta.svc.cluster.local        default.0             /fst     booted             rw      nodrain   online                     flat 

*******

The files is still in the fst:

[root@ctaeos /]# ls -l /fst/00000000/000018f6 
-rw-------. 1 daemon daemon 15360 Apr 11 10:58 /fst/00000000/000018f6

The root source of this problem seem to be during _dropallstripes The debug logs did not show the printing of debug error messages in this part of the code because something else down the stack is modifying the value of the errno variable.

Edited Apr 16, 2024 by Pablo Oliver Cortes