In #884 we have tried to run the "stress" test and boost the performance of the containerised deployment with many adjustments. Some of these might be useful for the general CI run to improve the speed at which out pipeline is running. We will collect them here.
Having 8 cores, 14 GB of memory available and only 1 disk of 80 GB.
Minikube resource parameters
These are configured in start_minikube.sh script. Considering running it on more powerful stress test and dev VMs as well, we might want to make it configurable.
For the setup of the CI runner I would keep the 6 cores:
local minikube_cores=6
In case we would decide to mount e.g. MHVTL or EOS data disk in memory (for faster performance) we could allocate more than the current default 8 GB and use 10 e.g. to have spacefor these:
local minikube_memory=10000
MHVTL
Depending on how big and how many files we will put on the tapes we should think of scaling this deployment accordingly, i.e. the size of the tapes, the number of tapes and the size of the mount of /opt/mhvtl directory.
Parameters to consider:
we do write about 512 B per file mark, we write 3 file marks per file (1 for file data, 1 for header data and 1 for trailer data). The header and trailer has in addition 240 B each. So the minimal file size of 1 B will generate 2.017 kB of data written to disk.
Below is an example of writing one 128B fule to tape - except the Blk No. 0 (tape label) you will see 3 blocks of 80 bytes of header, 1 block of data file and 3 blocks of trailer. The sz: shows the compression rate. As you can see the compression was 30/128 foe the data. The file content of the 128 B was 112 B of zeroes and last 16 B of unique string, to have different checksum per file. We might want to also consider adjusting the file generation scripts.What I describe here is not only existing as manual modification of the /home/cirunner/ci_monitoring/CTA_stress_test/client_ar.sh b/CTA_stress_test/client_ar.sh on the eosctafst0014 machine.
For the example case, we have 389 bytes written compressed to the tape, and we have about 75231 files on 30 MB tape in total (225694) file marks.
MHVTL tape format has a fixed 512 byte 'index' per tape block in addition. This data is written into a separate /opt/mhvtl/indx file. As you can see in the example above, we have 9 blocks written per 1 file. This results in 752319512 = 346 MB of index data written to disk, which is O(10) times more than the data itself we want to write for small files. I do not know why do we write header and trailed as 3 blocks to the tape (if this is not intentional, perhaps we could improve on this front too)
Since the test_client script seems to be configured only with:
NB_FILES=10000FILE_SIZE_KB=15
Considering this is non-compressible generation from /dev/random, and using an educated (based on above) guess, we should need 17kB per file of space in data file of MHVTL, i.e. 170 MB in total for the data.
Then, considering the default block size of 4 kB on the FS in use, the data would be written using 4 blocks instead of the 1 in previous example i.e. 3 more index blocks to consider per file, i.e. 12 tape block per file each 512 B, i.e. 62 MB in total. In a minimalistic setup we shall not need more than 240 MB of space for MHVLT tapes. To stay on safe side for reconfigurations, I would mount /opt/mhvtl on 1 GB volume.
We have seen in the "stress" tests that MHVTLs performance is much higher is MHVTL is mounted on a tmpfs volume:
# Home directory for configuration filesMHVTL_CONFIG_PATH=/etc/mhvtl# Home directory for tape contentMHVTL_HOME_PATH=/opt/mhvtl# Default media capacity in MbCAPACITY=20# Verbosity level [0|1|2|3]VERBOSE=0# Set kernel module debuging [0|1]VTL_DEBUG=0/etc/mhvtl/mhvtl.conf (END)
And in the /etc/mhvtl/device.conf changing the Backoff to Backoff: 50.
Note the VERBOSE=0 which prevents flooding of the log with messages from more frequent checks of the drive status.
In the stress tests, we were killing the cta-fst-gcd python script running as it was consuming quite a lot of CPU. I am not sure how critical this process is for the CI pipeline and if it has to be running or not.
We were also mounting quarkDB to separate disk (we do not have one here) and mounting the
FST disk to a RAM disk (can not be tmpfs as there was a change disallowing xattr on tmpfs for user accounts).
We could try to mount the FST on RAM disk, to separate it a bit from the logging and other load on the local disk. On the other hand, I am not sure if this is really worth it. One can find how to do this in #884.
In order to mount disk on SSDs or RAM disks, we had to make modifications in several places: ctaeos-mgm.sh, start_quarkdb.sh, xrd-config-eos.yaml or values.yaml scripts and rebuild the container in CI pileline so that the stress test deployment can pick up a new container. It would be nice to have the configuration decoupled for easier deployment.
Also unless we do need the logging info in the EOS pod for debugging purposes, we could also decrease the logging level, but I think this again is not that much of a gain for small run of CI pipeline eos debug err "*"; eos debug err /eos/ctaeos.toto.svc.cluster.local:1095/fst;
Number of Threads Configuration
We have 10 threads for disk read/write per drive and 10 configured per the entire VO, so half than number of threads running. In addition for 2 tape servers running (or perhaps currently 3) this is still unnecessarily many processes taking CPU cycles and logging.
I would go for less, even just 1 disk read/write thread per drive, e.g. for 2 drives we could:
NB_DRIVES = 2 when we call:
cta-admin vo ch --vo vo --writemaxdrives ${NB_DRIVES} --readmaxdrives ${NB_DRIVES}
Of course it would be great if this could be configurable at the same place where we check the number of drives or similar.
The number of queueing threads (e.g. in test_client.sh) are NB_PROCS=100, I achieved much less strain on the machine keeping the same rate with 20 processes.
I worked partially with different set of scripts for the client_ar.sh from the repo ci_monitoring which includes also a bit refactored scripts from the regular CI pipeline for the stress test use-case.
I believe the monitoring should be independent and not mixed with stress test use-case scripts in this repo.
In addition, CTA_stress_test contains just a bit refactored CI pipeline scripts. We could refactor what we have to make it configurable for the stress test use case without needing to have this in separate repo imho.
For better concurrency testing, we already have (thanks to @nbugel) the possibility to configure multiple drives. What we do need as well is to introduce flexibility to configure multiple libraries and tape pools.
The Helm setup already supports multiple libraries (at least in theory; I haven't tested this), but you have to manually provide the tape server config file for it to work properly. Of course, the mhvtl setup needs a particular configuration to allow this.
For now the backoff parameter and the VERBOSE flag for the mhvtl configuration have been updated in ctadev13 (backoff: 40 and VERBOSE: 0). I'll be monitoring the situation to see if it improves things. If so, we'll make these changes to the other cirunners as well.
The remainder of the suggestions we can investigate at a later point in time. Most likely before the hands-on workshop at some point we need to put some more work into this setup anyway, so then might be a nice time to have a good look at things.
Thank you Niels, if backoff parameter change does not improve the performancd it could be also the fact the the bottle neck is the SSD mount - which is degrade the persormance very quickly. I did play with this parameter only omce I had already the /opt/mhvtl mounted on a tmpfs ram disk. If you do not see big improvement when you mount on ram disk changing this parameter from 400 to 40, you can go as low as to 10.
That is good to know thanks. Most likely any more complicated changes we will do at a later point in time (e.g. before the workshop) as this is not very easy to properly update on the existing runners
The problem is not the changes we need to make, the problem is integrating them into the existing runners. For a dev machine this is straightforward, but for the GitLab runners this is a bit more complex sadly (basically changing a config option is as much as we can easily do at the moment).
I'll definitely look into this at some point to understand, streamline and document the whole process, but I'll not be doing that within the next few weeks : )