CI environment explodes if `cta-frontend` crashes or too many errors
This ticket is about solving root cause of CTA#1030
When instantiating CTA@4907767543b7860cf64eece3d315dfa32235b3e5 with: ./run_systemtest.sh -n toto -s tests/archive_retrieve.sh -d internal_postgres.yaml
Running cta-admin dr ls
crashes the cta frontend:
[root@ctadevmiguel orchestration]# kubectl -n toto exec ctacli -- cta-admin dr ls
210907 11:18:30 604 ssi_Pb::Request: pid:591 tid:140200244033280 ProcessResponseData(): fatal error from XRootD framework
[ERROR] Socket error
library drive host desired request status since vid tapepool vo files data MB/s session priority activity age reason disk_system_name reserved_bytes
210907 11:18:30 604 ssi_Shutdown: Unprovision: /ctafrontend@ctafrontend.toto.svc.cluster.local:10955 error; 14 [ERROR] Invalid session
I reproduced it on 2 different kubernetes boxes on the same commit.
Frontend crashes, generates core dumps and backtrace automatically in the logs persistent volume allocated for the namespace. This should not be small amount of data and the logs pv has a max size of 2GiB.
Later on, all xrdcp from client pod will fail and each of the 10k client archive, 10k client retrieve will redirect Dump
level log to ERROR_DIR
defined like this:
# Create directory for xrootd error reports
ERROR_DIR="/dev/shm/$(basename ${EOS_DIR})"
Its run location is in the logs of the run/create instance command: echo "$(date +%s): ERROR_DIR=${ERROR_DIR}"
. The second issue with this is that /dev/shm
is in memory: Dump level messages can eat quite a lot of space there especially if the frontend crashes early: the earlier the crash the more error dump messages, the more space consumed in /dev/shm
.
At this point this is quite bad but kubernetes
doesn't make any difference: the frontend crashed and logging as hard as CI would kill anything...
We could fix client_ar.sh
to stop sending too many ERROR messages in /dev/shm
: Its goal is to capture and save client side error messages on the flight but if all message fail because somethis is broken that badly... There is no point going further...
Reproduce and test
Kill ctafrontend pod (or cta-frontend
process on the pod) before client_ar.sh
loop is started.
What would make sense
client_ar.sh
is filling /dev/shm
with plenty of large verbose XROOTD debug messages.
There is no point continuing if we have too many ERROR reports.
Best solution would be to kill client_ar if more than 10% of the total files of the test fail (this is 1k failures for 10k files) or if more than 1/3 of /dev/shm
is filled with logs.
first can be implemented in client_ar.sh loop (not so easy given the parallel nature of the ERROR counter/cleanup but doable using xargs loop id per job), second is easier as a companion process in the client container.