Improvement of the default configuration of the CI deployment setup

added typemaintenance workflowin progress labels

assigned to @nbugel

The CI runner resources:

[cirunner@ctadev13 ~]$ lsblk -f
NAME  FSTYPE FSVER LABEL    UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
vda
├─vda1
│     xfs          ROOT     d398b5b1-02d8-4a40-8bb6-e379f055da73   12.5G    60% /
├─vda2
│     xfs          CONTAINERS
│                           1b19d501-6c9c-47ec-b524-0bddfdbec681   41.9G    13% /var/lib/containers/storage/overlay
│                                                                               /var/lib/containers
├─vda14
│
└─vda15
      vfat   FAT32 EFI      FB92-1A45                             536.9M     1% /boot/efi



[cirunner@ctadev13 ~]$ free -h
               total        used        free      shared  buff/cache   available
Mem:            14Gi       2.1Gi       7.7Gi       348Mi       5.0Gi        11Gi


[cirunner@ctadev13 ~]$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel Core Processor (Broadwell, no TSX, IBRS)
    CPU family:          6
    Model:               61
    Thread(s) per core:  1
    Core(s) per socket:  1
    Socket(s):           8

[cirunner@ctadev11 ~]$ df -lh
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           7.1G   88K  7.1G   1% /dev/shm
tmpfs           2.9G  280M  2.6G  10% /run
/dev/vda1        80G   28G   52G  35% /
/dev/vda15      544M  7.1M  537M   2% /boot/efi
tmpfs           1.5G     0  1.5G   0% /run/user/1000

[cirunner@ctadev12 ~]$ df -lh
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           7.1G   88K  7.1G   1% /dev/shm
tmpfs           2.9G  284M  2.6G  10% /run
/dev/vda1        80G   37G   43G  47% /
/dev/vda15      544M  7.1M  537M   2% /boot/efi
tmpfs           1.5G     0  1.5G   0% /run/user/1000

[cirunner@ctadev13 ~]$ df -lh
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           7.1G   88K  7.1G   1% /dev/shm
tmpfs           2.9G  277M  2.6G  10% /run
/dev/vda1        32G   19G   13G  61% /
/dev/vda2        48G   12G   37G  25% /var/lib/containers
/dev/vda15      544M  7.1M  537M   2% /boot/efi
tmpfs           1.5G     0  1.5G   0% /run/user/1000

Having 8 cores, 14 GB of memory available and only 1 disk of 80 GB.

Minikube resource parameters

These are configured in start_minikube.sh script. Considering running it on more powerful stress test and dev VMs as well, we might want to make it configurable.

For the setup of the CI runner I would keep the 6 cores: local minikube_cores=6

In case we would decide to mount e.g. MHVTL or EOS data disk in memory (for faster performance) we could allocate more than the current default 8 GB and use 10 e.g. to have spacefor these:

 local minikube_memory=10000

MHVTL

Depending on how big and how many files we will put on the tapes we should think of scaling this deployment accordingly, i.e. the size of the tapes, the number of tapes and the size of the mount of /opt/mhvtl directory.

Parameters to consider:

we do write about 512 B per file mark, we write 3 file marks per file (1 for file data, 1 for header data and 1 for trailer data). The header and trailer has in addition 240 B each. So the minimal file size of 1 B will generate 2.017 kB of data written to disk.

Below is an example of writing one 128B fule to tape - except the Blk No. 0 (tape label) you will see 3 blocks of 80 bytes of header, 1 block of data file and 3 blocks of trailer. The sz: shows the compression rate. As you can see the compression was 30/128 foe the data. The file content of the 128 B was 112 B of zeroes and last 16 B of unique string, to have different checksum per file. We might want to also consider adjusting the file generation scripts.What I describe here is not only existing as manual modification of the /home/cirunner/ci_monitoring/CTA_stress_test/client_ar.sh b/CTA_stress_test/client_ar.sh on the eosctafst0014 machine.

For the example case, we have 389 bytes written compressed to the tape, and we have about 75231 files on 30 MB tape in total (225694) file marks.

[cirunner@eosctafst0014 CTA]$  sudo dump_tape -m V00101TA | head -n 50
Media density code: 0x4a
Media type code   : 0x2c
Media description : STK T10KA media
Tape Capacity     : 31457280 (30 MBytes)
Media type        : Normal data
Media             : read-write
Remaining Tape Capacity : 0 (0  Bytes)
Total num of filemarks: 225694
Hdr:lzoCompressed with crc          (0x0b/0x0c), sz:     46/80    , Blk No.:       0, data:          0, CRC: b31ac383
Hdr:lzoCompressed with crc          (0x0b/0x0c), sz:     63/80    , Blk No.:       1, data:         46, CRC: 03a9f85a
Hdr:lzoCompressed with crc          (0x0b/0x0c), sz:     43/80    , Blk No.:       2, data:        109, CRC: 6d6be8d5
Hdr:lzoCompressed with crc          (0x0b/0x0c), sz:     73/80    , Blk No.:       3, data:        152, CRC: 2b7a08e4
Hdr:Filemark                        (0x03/0x00), sz:      0/0     , Blk No.:       4, data:        225, CRC: 00000000
Hdr:lzoCompressed with crc          (0x0b/0x0c), sz:     30/128   , Blk No.:       5, data:        225, CRC: 64e8d9f6
Hdr:Filemark                        (0x03/0x00), sz:      0/0     , Blk No.:       6, data:        255, CRC: 00000000
Hdr:lzoCompressed with crc          (0x0b/0x0c), sz:     64/80    , Blk No.:       7, data:        255, CRC: f99230a2
Hdr:lzoCompressed with crc          (0x0b/0x0c), sz:     43/80    , Blk No.:       8, data:        319, CRC: 9a5a5dc0
Hdr:lzoCompressed with crc          (0x0b/0x0c), sz:     73/80    , Blk No.:       9, data:        362, CRC: 821c74f5
Hdr:Filemark                        (0x03/0x00), sz:      0/0     , Blk No.:      10, data:        435, CRC: 00000000

MHVTL tape format has a fixed 512 byte 'index' per tape block in addition. This data is written into a separate /opt/mhvtl/indx file. As you can see in the example above, we have 9 blocks written per 1 file. This results in 752319512 = 346 MB of index data written to disk, which is O(10) times more than the data itself we want to write for small files. I do not know why do we write header and trailed as 3 blocks to the tape (if this is not intentional, perhaps we could improve on this front too)

Since the test_client script seems to be configured only with:

NB_FILES=10000
FILE_SIZE_KB=15

Considering this is non-compressible generation from /dev/random, and using an educated (based on above) guess, we should need 17kB per file of space in data file of MHVTL, i.e. 170 MB in total for the data. Then, considering the default block size of 4 kB on the FS in use, the data would be written using 4 blocks instead of the 1 in previous example i.e. 3 more index blocks to consider per file, i.e. 12 tape block per file each 512 B, i.e. 62 MB in total. In a minimalistic setup we shall not need more than 240 MB of space for MHVLT tapes. To stay on safe side for reconfigurations, I would mount /opt/mhvtl on 1 GB volume.

We have seen in the "stress" tests that MHVTLs performance is much higher is MHVTL is mounted on a tmpfs volume:

sudo systemctl stop mhvtl.target
sudo umount /opt/mhvtl
sudo mount -t tmpfs -o defaults,noatime,size=500MB tmpfs /opt/mhvtl

The /etc/mhvtl/mhvtl.conf could be changed to:


# Home directory for configuration files
MHVTL_CONFIG_PATH=/etc/mhvtl

# Home directory for tape content
MHVTL_HOME_PATH=/opt/mhvtl

# Default media capacity in Mb
CAPACITY=20

# Verbosity level [0|1|2|3]
VERBOSE=0

# Set kernel module debuging [0|1]
VTL_DEBUG=0
/etc/mhvtl/mhvtl.conf (END)

And in the /etc/mhvtl/device.conf changing the Backoff to Backoff: 50.

Note the VERBOSE=0 which prevents flooding of the log with messages from more frequent checks of the drive status.

mentioned in issue #834 (closed)

EOS

In the stress tests, we were killing the cta-fst-gcd python script running as it was consuming quite a lot of CPU. I am not sure how critical this process is for the CI pipeline and if it has to be running or not.

We were also mounting quarkDB to separate disk (we do not have one here) and mounting the FST disk to a RAM disk (can not be tmpfs as there was a change disallowing xattr on tmpfs for user accounts). We could try to mount the FST on RAM disk, to separate it a bit from the logging and other load on the local disk. On the other hand, I am not sure if this is really worth it. One can find how to do this in #884.

In order to mount disk on SSDs or RAM disks, we had to make modifications in several places: ctaeos-mgm.sh, start_quarkdb.sh, xrd-config-eos.yaml or values.yaml scripts and rebuild the container in CI pileline so that the stress test deployment can pick up a new container. It would be nice to have the configuration decoupled for easier deployment.

Also unless we do need the logging info in the EOS pod for debugging purposes, we could also decrease the logging level, but I think this again is not that much of a gain for small run of CI pipeline eos debug err "*"; eos debug err /eos/ctaeos.toto.svc.cluster.local:1095/fst;

Number of Threads Configuration

We have 10 threads for disk read/write per drive and 10 configured per the entire VO, so half than number of threads running. In addition for 2 tape servers running (or perhaps currently 3) this is still unnecessarily many processes taking CPU cycles and logging.

I would go for less, even just 1 disk read/write thread per drive, e.g. for 2 drives we could:

NB_DRIVES = 2 when we call:

cta-admin vo ch --vo vo --writemaxdrives ${NB_DRIVES} --readmaxdrives ${NB_DRIVES}

and then :

echo "taped NbDiskThreads 1" >> "${TAPED_CONF_FILE}"

Of course it would be great if this could be configurable at the same place where we check the number of drives or similar.

The number of queueing threads (e.g. in test_client.sh) are NB_PROCS=100, I achieved much less strain on the machine keeping the same rate with 20 processes.

Refactoring of scripts and repo content

I worked partially with different set of scripts for the client_ar.sh from the repo ci_monitoring which includes also a bit refactored scripts from the regular CI pipeline for the stress test use-case.

I believe the monitoring should be independent and not mixed with stress test use-case scripts in this repo.

In addition, CTA_stress_test contains just a bit refactored CI pipeline scripts. We could refactor what we have to make it configurable for the stress test use case without needing to have this in separate repo imho.

For better concurrency testing, we already have (thanks to @nbugel) the possibility to configure multiple drives. What we do need as well is to introduce flexibility to configure multiple libraries and tape pools.

The Helm setup already supports multiple libraries (at least in theory; I haven't tested this), but you have to manually provide the tape server config file for it to work properly. Of course, the mhvtl setup needs a particular configuration to allow this.

changed iteration to CTA Development Nov 7, 2024 - Nov 20, 2024

For now the backoff parameter and the VERBOSE flag for the mhvtl configuration have been updated in ctadev13 (backoff: 40 and VERBOSE: 0). I'll be monitoring the situation to see if it improves things. If so, we'll make these changes to the other cirunners as well.

The remainder of the suggestions we can investigate at a later point in time. Most likely before the hands-on workshop at some point we need to put some more work into this setup anyway, so then might be a nice time to have a good look at things.

Thank you Niels, if backoff parameter change does not improve the performancd it could be also the fact the the bottle neck is the SSD mount - which is degrade the persormance very quickly. I did play with this parameter only omce I had already the /opt/mhvtl mounted on a tmpfs ram disk. If you do not see big improvement when you mount on ram disk changing this parameter from 400 to 40, you can go as low as to 10.

That is good to know thanks. Most likely any more complicated changes we will do at a later point in time (e.g. before the workshop) as this is not very easy to properly update on the existing runners

Should not be too difficult since with the other info we might not need that much ram space for mhvtl. And we can adapt the tape size and setup too.

The problem is not the changes we need to make, the problem is integrating them into the existing runners. For a dev machine this is straightforward, but for the GitLab runners this is a bit more complex sadly (basically changing a config option is as much as we can easily do at the moment).

I'll definitely look into this at some point to understand, streamline and document the whole process, but I'll not be doing that within the next few weeks : )

marked this issue as related to #884

Improvement of the default configuration of the CI deployment setup

Designs

Child items ...

Activity