Skip to content

fix: resolve "Send failures" and improve logging for Remote Disconnected errors

This MR will resolve the "(55, 'Send failure')" errors typically seen as

│ /usr/local/venv/lib/python3.10/site-packages/itkdb/eos.py:64 in put │
│ │
│ 61 │ curl.setopt(curl.HEADERFUNCTION, buffer_header.write) │
│ 62 │ curl.setopt(curl.WRITEFUNCTION, buffer_body.write) │
│ 63 │ curl.setopt(curl.SEEKFUNCTION, fpointer.seek) │
│ ❱ 64 │ curl.perform() │
│ 65 │ curl.close() │
│ 66 │ │
│ 67 │ resp_header = buffer_header.getvalue().decode() │

error: (55, 'Send failure: Connection reset by peer')

Fundamentally, cURL (and libcurl / pycurl) do the right thing here following the documentation for Expect 100-Continue. However, there are scenarios or cases when the server (EOS in this case) is not sending the Expect: 100-continue in the response -- which cURL takes that to mean it should just send data to the load-balancer -- which immediately aborts the connection ("hangs up the phone").

This fix will tell cURL to always assume the Expect: 100-continue even if EOS never responds with it... this should at least be safe here because cURL is only used for EOS, and EOS is always a load-balancer that should be responding with Expect: 100-continue.

The failing example looks like:

* Host eosatlas.cern.ch:443 was resolved.
* IPv6: 2001:1458:301:1b::100:1c
* IPv4: 128.142.52.31
* Trying [2001:1458:301:1b::100:1c]:443...
* Immediate connect fail for 2001:1458:301:1b::100:1c: Network is unreachable
* Trying 128.142.52.31:443...
* ALPN: curl offers h2,http/1.1
* CAfile: /usr/local/venv/lib/python3.10/site-packages/itkdb/data/CERN_chain.pem
* CApath: none
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / x25519 / RSASSA-PSS
* ALPN: server did not agree on a protocol. Uses default.
* Server certificate:
* subject: DC=ch; DC=cern; OU=computers; CN=eosatlas-ns-ip700.cern.ch
* start date: Mar 24 06:59:30 2025 GMT
* expire date: Apr 28 06:59:30 2026 GMT
* subjectAltName: host "eosatlas.cern.ch" matched cert's "eosatlas.cern.ch"
* issuer: DC=ch; DC=cern; CN=CERN Grid Certification Authority
* SSL certificate verify ok.
* Certificate level 0: Public key type RSA (4096/152 Bits/secBits), signed using sha512WithRSAEncryption
* Certificate level 1: Public key type RSA (4096/152 Bits/secBits), signed using sha512WithRSAEncryption
* Connected to eosatlas.cern.ch (128.142.52.31) port 443
* using HTTP/1.x
> PUT /eos/atlas/atlascerngroupdisk/det-itk/prod-db/f/b/9/fb906cf72494ca90fdad4aa47b644123 HTTP/1.1^M
Host: eosatlas.cern.ch^M
Accept: */*^M
Authorization: Bearer zteos64:MDAwMDAyZjh4nO2Rr0sEQRTHvYNTMIkWEcNDDJa9nZmbm7196bJRTJ4XZnZmz-V0Z5mdRRAMFv8Ai_-GTTCJwSoYbBZBxB9FMAiCeIrBYDb5wgsfvt8v7_GdvK1P1t321Mnh09n4XCPzw0Iv1HeKpRpbDY0tQ-k35fdOjMsHzlaFzsphqI0PRvKwcFYHWoVpqMI4TFVMRJJGjMc8kTFJtdRcSh4pwTllreVFpgQVSYcEbdqKA0pTEiiqo0BylRIaxSJi8cpFrbdaGViucmAcGEPGkbSAEdbuQ5VppJyKDu193duHwSdpkai3U_TBZ9rkHp21vslHgVQIjDjt9hDT0WCbNikVzbjTpKQPudwy-BUDOkcYfeNx6FQbZFEgbNjS4-9GbbdkluMPMjAWoay0RQLeycQg2FyZDbmZ4jScz1-uz0xcH7jZu6PTWoPtPpZXb8e2uz_2fr_nXtbm_pv4kybg-eb1ofEBPRntgw%3d%3d^M
User-Agent: itkdb/0.6.15^M
Content-Type: application/zip^M
Content-Length: 710409^M
^M
* Send failure: Connection reset by peer
* OpenSSL SSL_write: OpenSSL/3.4.1: error:80000068:system library::Connection reset by peer, errno 104
* closing connection #0

While a succeeding example looks like:

$ python try_eos_upload.py
* Host eosatlas.cern.ch:443 was resolved.
* IPv6: 2001:1458:301:1b::100:1c
* IPv4: 128.142.52.31
*   Trying [2001:1458:301:1b::100:1c]:443...
* Connected to eosatlas.cern.ch (2001:1458:301:1b::100:1c) port 443
* ALPN: curl offers h2,http/1.1
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* ALPN: server did not agree on a protocol. Uses default.
* Server certificate: eosatlas-ns-ip700.cern.ch
* Server certificate: CERN Grid Certification Authority
* using HTTP/1.x
> PUT //eos/atlas/atlascerngroupdisk/det-itk/prod-db/f/b/9/fb906cf72494ca90fdad4aa47b644123 HTTP/1.1
Host: eosatlas.cern.ch
Accept: */*
Authorization: Bearer zteos64:MDAwMDAyZmF4nO2RvUoDQRSFScCAKUSsJNVFLCzc7Mzu7M_cysbCNEIQREKQ2ZnZZEmys2wmCHkF38B3sBbETpt0PoCFhVgrWqSwMAYLC2srT3GLj3Mu93Lqz9V6tTxbn11e3dYaK5kdFGqrOi12Kt6Rq83YFXYovqfUZd4rzaRQ2XjgKm2dhd0tSqMclbipm7jcTRNOQplGHuNMCk5SJRQTgkVJyBj1_Na2jGNBQ6GdQIbSoTQlDqeSO4IlKaERDyOPt-8rnWOtoDUZAnAgDH2K1AePeEEXJplCymgY087y3i70vohPos606ILNlM4tlsbYpr8QDTwMQhbvdRDThTCgTUrDJo-blHQhFyONyz2gcoTFOxYHZRKAKAqEvhlb_D2ozEhkOf4gPW0QxhNlkIAthdQIJk90XwxT3IAT57ScvT7uvmu-dv0Qy_nT5l07O7hZ5S_783h-2Piv4m-qgIu384_aJyHg7gE%3d
User-Agent: python/3.12
Content-Type: application/zip
Content-Length: 8076819
Expect: 100-continue

< HTTP/1.1 307 TEMPORARY_REDIRECT
< Connection: Keep-Alive
< Server: XrootD/5.8.3
...
...
...
< HTTP/1.1 100 Continue
< Connection: Close
< Server: XrootD/5.8.3
< Date: Wed, 09 Jul 2025 02:31:15 GMT
<
* upload completely sent off: 8076819 bytes
< HTTP/1.1 201 CREATED
< Connection: Keep-Alive
< Server: XrootD/5.8.3
< Content-Length: 0
< ETag: "1036634903216652288:cc7595f5"
<
* Closing connection

Additionally, when RemoteDisconnected connection errors occur, there is a traceback that is not too helpful. Using the fact that requests.RemoteDisconnected errors always have a requests.Request (or requests.PreparedRequest) object on them, we will use itkdb.utils.pretty_print to give us some information on the request that failed to aid in further debugging -- these are typically from PDB.

References / other issues:

Edited by Giordon Holtsberg Stark

Merge request reports

Loading