v5.11.2.0-1
Initial code references for release
This release should be based on <commit_id>
(TBD) from branch main
.
This release will be tagged as version v.5.11.2.0-1
.
Additional details
This release aims to fix several of the problems found on the previous version v.5.11.1.0-1
.
Stress test results
ci_monitoring
Changes to theCTA_stress_test/CTA_stress_test.sh
had the following changes made w.r.t. the main branch:
NB_FILES=20000
NB_PROCS=40
NB_DRIVES=2
The CTA_stress_test/client_ar.sh
had the following changes made w.r.t. the main branch:
DD_BS=112
NB_BATCH_PROCS=20 # number of parallel batch processes (was 500)
BATCH_SIZE=100 # number of files per batch process (was 20)
and we were generating and transferring more compressible files - altogether made up to 128B per file.
+done | xargs --max-procs=${NB_PROCS} -iTEST_FILE_NUM bash -c "
+ {
+ dd if=/tmp/testfile bs=${DD_BS} skip=$((${subdir} * ${NB_FILES} + TEST_FILE_NUM)) count=${FILE_KB_SIZE} 2>/dev/null ;
+ echo UNIQUE_${subdir}_TEST_FILE_NUM;
+ } | XRD_LOGLEVEL=Dump xrdcp - root://${EOSINSTANCE}/${EOS_DIR}/${subdir}/${TEST_FILE_NAME_SUBDIR}TEST_FILE_NUM_$(date +%s%N) 2>${ERROR_DIR}/${TEST_FILE_NAME_S
UBDIR}TEST_FILE_NUM && rm ${ERROR_DIR}/${TEST_FILE_NAME_SUBDIR}TEST_FILE_NUM || echo ERROR with xrootd transfer for file ${TEST_FILE_NAME_SUBDIR}TEST_FILE_NUM, ful
l logs in ${ERROR_DIR}/${TEST_FILE_NAME_SUBDIR}TEST_FILE_NUM"
The stress test has been running on eosctafst0014
machine having mhvtl
mounted on a RAM disk and disabled logging VERBOSE=0
to limit the messages after setting Backoff: 10
.
We used 160 x 30MB tapes.
v.5.11.2.0-1 IMAGE_TAG=9582005git4825ccca
The stress test dashboard under construction
- The pre-queueing of about 45 min was necessary, without it the drives starve for work to be done as the queueing is too slow to give them work and we see many many sequential mounts. With the 1M files pre-queued, when putting a drive up, the OStoreDB scheduler holds the queueing to minimum and then drives start to process the files.
-
archival
-
retrieve
- the retrieval requests started to accumulate for first 40 minutes (up until 675209 files) then drives were put UP and transfers started flowing
- the total time of retrieve transfers was 1h18min for 1304524 files --> 279 Hz for 2 drives --> 139 Hz per drive
- total queueing time for 1261366 files was 59 minutes, more files were counted as retrieved than the files which were queued for retrieve (might be due to errors/retires or the reasons ... to be understood)
- not all the files were retrieved, the error messages such as the ones below started appearing in the logs after a while
ERROR with xrootd xattr get for file test99019992_1733955257080487104, full logs in /dev/shm/401d30b4-c532-46db-b4e0-8e8faca1c481/XATTRGET_test99019992_1733955257080487104 [...]
-
The previous release test:
v.5.10.10.1-1 IMAGE_TAG=9585483gitcc2af92a
Due to the monitoring being enabled only after the stress test has finished, the queueing information from the frontend did not make it to the DB (buffer overflow error). This can be improved later. We will compare the archival and retrieval rates for now only.
- archival * file transfer duration total 2h27 min for 2002093 files --> 227 Hz for 2 drives --> 113 Hz per drive * as the queue was emptied the mounts started to get shorter as drives were starving for jobs
-
retrieve
* the total time of retrieve transfers was 1h15min for 1322007 files --> 294 Hz for 2 drives --> 147 Hz per drive
* not all the files were retrieved, the error messages such as the ones below started appearing in the logs after a while
ERROR with xrootd xattr get for file test99019996_1734010903477503182, full logs in /dev/shm/a881a001-dad2-4770-a837-ad0750b1ac60/XATTRGET_test99019996_1734010903477503182 [...]