Streamline stress test machine setup

added Continuous Integration ci: pipeline ci: runners typemaintenance labels

assigned to @nbugel

created branch 1053-streamline-stress-test-machine-setup to address this issue

mentioned in merge request !779 (merged)

marked this issue as related to #1057

Sadly the current stress test setup is running into some XRootD related errors. This is most likely due to a misconfiguration on the EOS side. For now to get the stress test running, I have checked out an older version of our repo so that we can use the old setup (but with the latest image).

In the meantime I am looking into why this is failing

In theory the current stress test should work (provided we don't run into issues with how the machine is configured):

We are now using the alma9 version of EOS 5.2.27
Fixed a number of bugs in the stress test-related scripts
Added some missing things to the EOS mgm configuration

Let's see how this goes now. In any case, for the next release we can already deploy in preproduction as this version is only needed for the repack dual copy fix

Seems like the test is slowing down significantly. For some reason cta-fst-gcd is using an extreme amount of CPU:

I guess it's slowing down once the drives are set up and the taped processes start doing work

https://meter-cta.web.cern.ch/d/y6E9rBriz/metadata-stress-test-pg-scheduler-development?orgId=1&from=now-3h&to=now&var-resolution=1&refresh=5s

For reference

Seems like it stopped about halfway through for some reason. cta-admin sq does not show anything so things are not being queued anymore and htop also shows very little activity. At first glace, logs are not providing more information.

EOS is showing this though, which is a bit suspicious:

EOS Console [root://localhost] |/> df
┌──────────────┬────────┬──────────┬────────┬───────────────┬────────────┬────────────┬───────┬───────────┐
│Instance      │    Size│      Used│   Files│    Directories│ PCR GB/TB*s│        Use%│  Vol-x│       Path│
└──────────────┴────────┴──────────┴────────┴───────────────┴────────────┴────────────┴───────┴───────────┘
 ctaeos             0 iB 146.69 MiB   1.20 M              35         0.00 -2147483648%    1.00 /eos/ctaeos

From the monitoring plot it seems that the queueing stopped at 1.2M and then the drives simply processed what was left in the queue. I would check the space config of the eos mgm and fst charts and try to make sure there is space enough to store the files we need.

another observation, in the eos-fst-0 pod there is this line reporting in a loop:

Events:
  Type     Reason                           Age                   From     Message
  ----     ------                           ----                  ----     -------
  Warning  FailedToRetrieveImagePullSecret  90s (x349 over 7h7m)  kubelet  Unable to retrieve some image pull secrets (reg-ctageneric); attempting to pull the image may not succeed.

Could it be that some credentials expired in the process of queueing the files ?

This FailedToRetrieveImagePullSecret is to be expected. We provide multiple Kubernetes secrets to pull the image (as we did not have a consistent naming standard a while back), but not all of them might exist. Provided one of them works it is okay. This is only for image pulling though; it doesn't affect runtime behaviour

There is a bug in the client_ar.sh, apparently in the last test it created exactly 11 directories:

[root@eos-mgm-0 /]# eos ls -ly /eos/ctaeos/preprod/850444a8-4f36-4bb6-9328-83984a950ee6/
d1::t0   dr-xr-xr-+   1 user1    eosusers     12800000 Jan 30 13:29 0
d1::t0   dr-xr-xr-+   1 user1    eosusers     12800000 Jan 30 13:34 1
d1::t0   dr-xr-xr-+   1 user1    eosusers     12900000 Jan 30 14:21 10
d1::t0   dr-xr-xr-+   1 user1    eosusers     12900000 Jan 30 15:14 11
d1::t0   dr-xr-xr-+   1 user1    eosusers     12800000 Jan 30 13:39 2
d1::t0   dr-xr-xr-+   1 user1    eosusers     12800000 Jan 30 13:44 3
d1::t0   dr-xr-xr-+   1 user1    eosusers     12800000 Jan 30 13:49 4
d1::t0   dr-xr-xr-+   1 user1    eosusers     12800000 Jan 30 13:55 5
d1::t0   dr-xr-xr-+   1 user1    eosusers     12800000 Jan 30 14:00 6
d1::t0   dr-xr-xr-+   1 user1    eosusers     12800000 Jan 30 14:05 7
d1::t0   dr-xr-xr-+   1 user1    eosusers     12800000 Jan 30 14:10 8
d1::t0   dr-xr-xr-+   1 user1    eosusers     12800000 Jan 30 14:16 9

after filling the 11th one, it did try to put the drives up a second time (bug: if (( (subdir + 1) > 10 )); then --> if (( subdir == 10 )); then) and possibly hang there while the drives processed the backlog and the script did not output any more log lines. We could simply try to fix this and launch the script from subdir=12 to confirm the hypothesis.

After relaunch from subdir=12 with drives down and putting them up at subdir == 13, the queueing stopped again after reaching subdir == 14. The suspect now is the admin_cta kinit commands in the for loop. I removed these and relaunched again from subdir=15

this resulted in the following:

Copying files to /eos/ctaeos/preprod/c02fd7a0-19d3-45c6-8f84-55f73686cdd5/15 using 40 processes...Starting to queue the files for the subdir.
Done.
Copying files to /eos/ctaeos/preprod/c02fd7a0-19d3-45c6-8f84-55f73686cdd5/16 using 40 processes...Starting to queue the files for the subdir.
Putting drives up
Drive VDSTK01: set Up.
Drive VDSTK02: set Up.
Done.
Copying files to /eos/ctaeos/preprod/c02fd7a0-19d3-45c6-8f84-55f73686cdd5/17 using 40 processes...Starting to queue the files for the subdir.
+ exit 0

What puzzles me is that this should print "Done." after it finished queueing the files for dir 17 and I would not expect it to exit peacefully with 0.

[root@eos-mgm-0 /]# eos ls -ly /eos/ctaeos/preprod/c02fd7a0-19d3-45c6-8f84-55f73686cdd5/17 | grep -c test
100000

the directory 17 was filled so it crashed somewhere after the last file was put in and the end of the loop.

The problem was in the testfile creation not having enough data. We (@nbugel) have improved/simplified a lot this logic and the stress test is running now with no issues.

Taking it back, the problem was none of the above, it stays present. The last chance is that is has to do with ObjectStore global lock which prevents all queueing during the digestion of the queue and the xrdcp streaming timeout after 10 minutes of the last time no data has been moved. I am adding the XRD_STREAMTIMEOUT=7200 to the xrdcp command to see if this helps.

It did not solve the issue either - the queueing stopped at 21 directories.

with more logging:

[...]
Before check of subdir number 19
Putting drives up
Drive VDSTK01: set Up.
Drive VDSTK02: set Up.
Done.
eos mkdir exit status: 0 \n
Copying files to /eos/ctaeos/preprod/dddb18c8-2091-4dfd-9afe-8a6da9020590/20 using 40 processes...\n
test20
Starting to queue the 50000 files for the subdir 20/40.
+ exit 0


[root@eos-mgm-0 /]# eos ls -ly /eos/ctaeos/preprod/dddb18c8-2091-4dfd-9afe-8a6da9020590
d1::t0   dr-xr-xr-+   1 user1    eosusers      6400000 Feb  1 01:14 0
d1::t0   dr-xr-xr-+   1 user1    eosusers      6400000 Feb  1 01:17 1
d1::t0   dr-xr-xr-+   1 user1    eosusers      6450000 Feb  1 01:40 10
d1::t0   dr-xr-xr-+   1 user1    eosusers      6450000 Feb  1 01:42 11
d1::t0   dr-xr-xr-+   1 user1    eosusers      6450000 Feb  1 01:45 12
d1::t0   dr-xr-xr-+   1 user1    eosusers      6450000 Feb  1 01:48 13
d1::t0   dr-xr-xr-+   1 user1    eosusers      6450000 Feb  1 01:50 14
d1::t0   dr-xr-xr-+   1 user1    eosusers      6450000 Feb  1 01:53 15
d1::t0   dr-xr-xr-+   1 user1    eosusers      6450000 Feb  1 01:55 16
d1::t0   dr-xr-xr-+   1 user1    eosusers      6450000 Feb  1 01:58 17
d1::t0   dr-xr-xr-+   1 user1    eosusers      6450000 Feb  1 02:01 18
d1::t0   dr-xr-xr-+   1 user1    eosusers      6450000 Feb  1 02:03 19
d1::t0   dr-xr-xr-+   1 user1    eosusers      6400000 Feb  1 01:19 2
d1::t0   dr-xr-xr-+   1 user1    eosusers      6450000 Feb  1 02:27 20
d1::t0   dr-xr-xr-+   1 user1    eosusers            0 Feb  1 02:27 21
d1::t0   dr-xr-xr-+   1 user1    eosusers      6400000 Feb  1 01:22 3
d1::t0   dr-xr-xr-+   1 user1    eosusers      6400000 Feb  1 01:24 4
d1::t0   dr-xr-xr-+   1 user1    eosusers      6400000 Feb  1 01:27 5
d1::t0   dr-xr-xr-+   1 user1    eosusers      6400000 Feb  1 01:29 6
d1::t0   dr-xr-xr-+   1 user1    eosusers      6400000 Feb  1 01:32 7
d1::t0   dr-xr-xr-+   1 user1    eosusers      6400000 Feb  1 01:35 8
d1::t0   dr-xr-xr-+   1 user1    eosusers      6400000 Feb  1 01:37 9

I will try to cleanup a bit the parameter expansion in the xargs, add more logging and in addition, cleanup the milions of logs created in the client-0 pod form the xrdcp processes (!), then rerun the queueing again. If this does not help, I will fall back to a while loop and managing the parallel processes without xargs.

The bottom line of the problems:

OStoreDB was never stress tested with pre-queueing with more than 1M ifles, the pre-queueing has the effect that as soon as the 1 drive takes the global scheduling lock for processinf further queueing is stalled until it actually crashes the queueing script

I tried to mitigate this putting queueing to sleep if I detect 1 job taking more than 5 min and increasing the xrootd stream timeout in addition

there were for each file 1 or 2 log files created in the client-0 pod in 1 single directory - this became unmanageable for the stress test script
the client-0 pod has only 64 MB space in the /dev/shm directory where all the logs are stored, this causes logging to fill it up quickly - especially since the logging of the xrdcp and xrdfs processes was set to Dump

I am trying to mitigate these issues and will run another test.

I have created new operations ticket for pending (less critical) issues which should be fixed: https://gitlab.cern.ch/cta/operations/-/issues/1603

running another test now.

Another thing we should do is add tags to the run so that we can see what we are running. The pipeline and commit ID should be sufficient

I would add EOS version if possible too.

Yes good point

I did not find a way to add the EOS tag yet, but the CTA [version]-[pipeline]git[commit] is already being monitored. https://meter-cta.web.cern.ch/d/be6ly0afw1s00b/metadata-stress-test?orgId=1&from=now-3h&to=now

added ci: tests label

removed Needs Discussion label

closed with merge request !779 (merged)

mentioned in commit 5084902d

Streamline stress test machine setup

Designs

Child items 0

Activity

Streamline stress test machine setup

Relates to

Activity