The manual run of the stress test is quite painful currently. We should streamline this process by making the stress test be a (manually triggered) CI pipeline job. Additionally, we should think about how we handle the monitoring chart...
Sadly the current stress test setup is running into some XRootD related errors. This is most likely due to a misconfiguration on the EOS side. For now to get the stress test running, I have checked out an older version of our repo so that we can use the old setup (but with the latest image).
In the meantime I am looking into why this is failing
In theory the current stress test should work (provided we don't run into issues with how the machine is configured):
We are now using the alma9 version of EOS 5.2.27
Fixed a number of bugs in the stress test-related scripts
Added some missing things to the EOS mgm configuration
Let's see how this goes now. In any case, for the next release we can already deploy in preproduction as this version is only needed for the repack dual copy fix
Seems like it stopped about halfway through for some reason. cta-admin sq does not show anything so things are not being queued anymore and htop also shows very little activity. At first glace, logs are not providing more information.
EOS is showing this though, which is a bit suspicious:
From the monitoring plot it seems that the queueing stopped at 1.2M and then the drives simply processed what was left in the queue. I would check the space config of the eos mgm and fst charts and try to make sure there is space enough to store the files we need.
another observation, in the eos-fst-0 pod there is this line reporting in a loop:
Events: Type Reason Age From Message------------------------- Warning FailedToRetrieveImagePullSecret 90s (x349 over 7h7m) kubelet Unable to retrieve some image pull secrets (reg-ctageneric); attempting to pull the image may not succeed.
This FailedToRetrieveImagePullSecret is to be expected. We provide multiple Kubernetes secrets to pull the image (as we did not have a consistent naming standard a while back), but not all of them might exist. Provided one of them works it is okay. This is only for image pulling though; it doesn't affect runtime behaviour
There is a bug in the client_ar.sh, apparently in the last test it created exactly 11 directories:
[root@eos-mgm-0 /]# eos ls-ly /eos/ctaeos/preprod/850444a8-4f36-4bb6-9328-83984a950ee6/d1::t0 dr-xr-xr-+ 1 user1 eosusers 12800000 Jan 30 13:29 0d1::t0 dr-xr-xr-+ 1 user1 eosusers 12800000 Jan 30 13:34 1d1::t0 dr-xr-xr-+ 1 user1 eosusers 12900000 Jan 30 14:21 10d1::t0 dr-xr-xr-+ 1 user1 eosusers 12900000 Jan 30 15:14 11d1::t0 dr-xr-xr-+ 1 user1 eosusers 12800000 Jan 30 13:39 2d1::t0 dr-xr-xr-+ 1 user1 eosusers 12800000 Jan 30 13:44 3d1::t0 dr-xr-xr-+ 1 user1 eosusers 12800000 Jan 30 13:49 4d1::t0 dr-xr-xr-+ 1 user1 eosusers 12800000 Jan 30 13:55 5d1::t0 dr-xr-xr-+ 1 user1 eosusers 12800000 Jan 30 14:00 6d1::t0 dr-xr-xr-+ 1 user1 eosusers 12800000 Jan 30 14:05 7d1::t0 dr-xr-xr-+ 1 user1 eosusers 12800000 Jan 30 14:10 8d1::t0 dr-xr-xr-+ 1 user1 eosusers 12800000 Jan 30 14:16 9
after filling the 11th one, it did try to put the drives up a second time (bug: if (( (subdir + 1) > 10 )); then --> if (( subdir == 10 )); then) and possibly hang there while the drives processed the backlog and the script did not output any more log lines. We could simply try to fix this and launch the script from subdir=12 to confirm the hypothesis.
After relaunch from subdir=12 with drives down and putting them up at subdir == 13, the queueing stopped again after reaching subdir == 14. The suspect now is the admin_cta kinit commands in the for loop. I removed these and relaunched again from subdir=15
Copying files to /eos/ctaeos/preprod/c02fd7a0-19d3-45c6-8f84-55f73686cdd5/15 using 40 processes...Starting to queue the files for the subdir.Done.Copying files to /eos/ctaeos/preprod/c02fd7a0-19d3-45c6-8f84-55f73686cdd5/16 using 40 processes...Starting to queue the files for the subdir.Putting drives upDrive VDSTK01: set Up.Drive VDSTK02: set Up.Done.Copying files to /eos/ctaeos/preprod/c02fd7a0-19d3-45c6-8f84-55f73686cdd5/17 using 40 processes...Starting to queue the files for the subdir.+ exit 0
What puzzles me is that this should print "Done." after it finished queueing the files for dir 17 and I would not expect it to exit peacefully with 0.
The problem was in the testfile creation not having enough data. We (@nbugel) have improved/simplified a lot this logic and the stress test is running now with no issues.
Taking it back, the problem was none of the above, it stays present. The last chance is that is has to do with ObjectStore global lock which prevents all queueing during the digestion of the queue and the xrdcp streaming timeout after 10 minutes of the last time no data has been moved. I am adding the XRD_STREAMTIMEOUT=7200 to the xrdcp command to see if this helps.
[...]Before check of subdir number 19Putting drives upDrive VDSTK01: set Up.Drive VDSTK02: set Up.Done.eos mkdir exit status: 0 \nCopying files to /eos/ctaeos/preprod/dddb18c8-2091-4dfd-9afe-8a6da9020590/20 using 40 processes...\ntest20Starting to queue the 50000 files for the subdir 20/40.+ exit 0[root@eos-mgm-0 /]# eos ls-ly /eos/ctaeos/preprod/dddb18c8-2091-4dfd-9afe-8a6da9020590d1::t0 dr-xr-xr-+ 1 user1 eosusers 6400000 Feb 1 01:14 0d1::t0 dr-xr-xr-+ 1 user1 eosusers 6400000 Feb 1 01:17 1d1::t0 dr-xr-xr-+ 1 user1 eosusers 6450000 Feb 1 01:40 10d1::t0 dr-xr-xr-+ 1 user1 eosusers 6450000 Feb 1 01:42 11d1::t0 dr-xr-xr-+ 1 user1 eosusers 6450000 Feb 1 01:45 12d1::t0 dr-xr-xr-+ 1 user1 eosusers 6450000 Feb 1 01:48 13d1::t0 dr-xr-xr-+ 1 user1 eosusers 6450000 Feb 1 01:50 14d1::t0 dr-xr-xr-+ 1 user1 eosusers 6450000 Feb 1 01:53 15d1::t0 dr-xr-xr-+ 1 user1 eosusers 6450000 Feb 1 01:55 16d1::t0 dr-xr-xr-+ 1 user1 eosusers 6450000 Feb 1 01:58 17d1::t0 dr-xr-xr-+ 1 user1 eosusers 6450000 Feb 1 02:01 18d1::t0 dr-xr-xr-+ 1 user1 eosusers 6450000 Feb 1 02:03 19d1::t0 dr-xr-xr-+ 1 user1 eosusers 6400000 Feb 1 01:19 2d1::t0 dr-xr-xr-+ 1 user1 eosusers 6450000 Feb 1 02:27 20d1::t0 dr-xr-xr-+ 1 user1 eosusers 0 Feb 1 02:27 21d1::t0 dr-xr-xr-+ 1 user1 eosusers 6400000 Feb 1 01:22 3d1::t0 dr-xr-xr-+ 1 user1 eosusers 6400000 Feb 1 01:24 4d1::t0 dr-xr-xr-+ 1 user1 eosusers 6400000 Feb 1 01:27 5d1::t0 dr-xr-xr-+ 1 user1 eosusers 6400000 Feb 1 01:29 6d1::t0 dr-xr-xr-+ 1 user1 eosusers 6400000 Feb 1 01:32 7d1::t0 dr-xr-xr-+ 1 user1 eosusers 6400000 Feb 1 01:35 8d1::t0 dr-xr-xr-+ 1 user1 eosusers 6400000 Feb 1 01:37 9
I will try to cleanup a bit the parameter expansion in the xargs, add more logging and in addition, cleanup the milions of logs created in the client-0 pod form the xrdcp processes (!), then rerun the queueing again.
If this does not help, I will fall back to a while loop and managing the parallel processes without xargs.
OStoreDB was never stress tested with pre-queueing with more than 1M ifles, the pre-queueing has the effect that as soon as the 1 drive takes the global scheduling lock for processinf further queueing is stalled until it actually crashes the queueing script
I tried to mitigate this putting queueing to sleep if I detect 1 job taking more than 5 min and increasing the xrootd stream timeout in addition
there were for each file 1 or 2 log files created in the client-0 pod in 1 single directory - this became unmanageable for the stress test script
the client-0 pod has only 64 MB space in the /dev/shm directory where all the logs are stored, this causes logging to fill it up quickly - especially since the logging of the xrdcp and xrdfs processes was set to Dump
I am trying to mitigate these issues and will run another test.