v5.11.2.0-1

See Tagging a new CTA release

Initial code references for release

This release should be based on <commit_id> (TBD) from branch main.

This release will be tagged as version v.5.11.2.0-1.

Additional details

This release aims to fix several of the problems found on the previous version v.5.11.1.0-1.

Stress test results

Changes to the ci_monitoring

CTA_stress_test/CTA_stress_test.sh had the following changes made w.r.t. the main branch:

NB_FILES=20000
NB_PROCS=40
NB_DRIVES=2

The CTA_stress_test/client_ar.sh had the following changes made w.r.t. the main branch:

DD_BS=112
NB_BATCH_PROCS=20  # number of parallel batch processes (was 500)
BATCH_SIZE=100    # number of files per batch process (was 20)

and we were generating and transferring more compressible files - altogether made up to 128B per file.

+done |  xargs --max-procs=${NB_PROCS} -iTEST_FILE_NUM bash -c "
+    {
+      dd if=/tmp/testfile bs=${DD_BS} skip=$((${subdir} * ${NB_FILES} + TEST_FILE_NUM)) count=${FILE_KB_SIZE} 2>/dev/null ;
+      echo UNIQUE_${subdir}_TEST_FILE_NUM;
+    } | XRD_LOGLEVEL=Dump xrdcp - root://${EOSINSTANCE}/${EOS_DIR}/${subdir}/${TEST_FILE_NAME_SUBDIR}TEST_FILE_NUM_$(date +%s%N) 2>${ERROR_DIR}/${TEST_FILE_NAME_S
UBDIR}TEST_FILE_NUM && rm ${ERROR_DIR}/${TEST_FILE_NAME_SUBDIR}TEST_FILE_NUM || echo ERROR with xrootd transfer for file ${TEST_FILE_NAME_SUBDIR}TEST_FILE_NUM, ful
l logs in ${ERROR_DIR}/${TEST_FILE_NAME_SUBDIR}TEST_FILE_NUM"

The stress test has been running on eosctafst0014 machine having mhvtl mounted on a RAM disk and disabled logging VERBOSE=0 to limit the messages after setting Backoff: 10.

We used 160 x 30MB tapes.

v.5.11.2.0-1 IMAGE_TAG=9582005git4825ccca

The stress test dashboard under construction

The pre-queueing of about 45 min was necessary, without it the drives starve for work to be done as the queueing is too slow to give them work and we see many many sequential mounts. With the 1M files pre-queued, when putting a drive up, the OStoreDB scheduler holds the queueing to minimum and then drives start to process the files.
- archival
  - file transfer duration total 2h32 min for 2001805 files --> 219 Hz for 2 drives --> 110 Hz per drive
  - 45min pre-queueing of 1M files at 380 Hz
  - total queueing time 3h10min
  - as the queue was emptied the mounts started to get shorter as drives were starving for jobs
- retrieve
  - the retrieval requests started to accumulate for first 40 minutes (up until 675209 files) then drives were put UP and transfers started flowing
  - the total time of retrieve transfers was 1h18min for 1304524 files --> 279 Hz for 2 drives --> 139 Hz per drive
  - total queueing time for 1261366 files was 59 minutes, more files were counted as retrieved than the files which were queued for retrieve (might be due to errors/retires or the reasons ... to be understood)
  - not all the files were retrieved, the error messages such as the ones below started appearing in the logs after a while ERROR with xrootd xattr get for file test99019992_1733955257080487104, full logs in /dev/shm/401d30b4-c532-46db-b4e0-8e8faca1c481/XATTRGET_test99019992_1733955257080487104 [...]

The previous release test:

v.5.10.10.1-1 IMAGE_TAG=9585483gitcc2af92a

Due to the monitoring being enabled only after the stress test has finished, the queueing information from the frontend did not make it to the DB (buffer overflow error). This can be improved later. We will compare the archival and retrieval rates for now only.

archival * file transfer duration total 2h27 min for 2002093 files --> 227 Hz for 2 drives --> 113 Hz per drive * as the queue was emptied the mounts started to get shorter as drives were starving for jobs

retrieve * the total time of retrieve transfers was 1h15min for 1322007 files --> 294 Hz for 2 drives --> 147 Hz per drive * not all the files were retrieved, the error messages such as the ones below started appearing in the logs after a while ERROR with xrootd xattr get for file test99019996_1734010903477503182, full logs in /dev/shm/a881a001-dad2-4770-a837-ad0750b1ac60/XATTRGET_test99019996_1734010903477503182 [...]

Edited Dec 12, 2024 by Jaroslav Guenther