gfal-copy fails to complete the data transfer but returns exit code 0
Hello,
in the CMS central production system, we have seen over the last months few cases where files are not completely written to the site storage, but still the gfal-copy
command returns an exit status 0.
The environment we have seen it (not necessarily the only one) is:
- using a singularity container based on RHEL8
- gfal version is:
Singularity> gfal2_version
GFAL-client-2.21.4
- popping up at multiple sites
- the command executed is:
gfal-copy -t 2400 -T 2400 -p -K adler32 sourcePFN targetPFN
The error we get in the logs is:
gfal-copy error: 256 (Unknown error 256) - TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 500 : Unexpected server error: 500 (destination)
while gfal command says that everything is okay (exit status 0).
I wonder if this is actually an issue with gfal-utils? Or is it happening just because I am not using the --abort-on-failure
option mentioned in the help pages? I deeply appreciate any advice that you might give on this.
Thanks, Alan.
PS.: it's a long discussion, but in case you need further details and better context, most of it has been discussed in this GH issue: https://github.com/dmwm/WMCore/issues/11556#issuecomment-1561740807