Skip to content

Repack VO not respecting the `readmaxdrive` limits

Summary

Repack VO readmaxdrives limit is not respected when repacking tapes from multiple VOs/ tape pools. After an initial look at the code all the logic seems correct.

Steps to reproduce

Add repacks from different VOs/tape pools exceeding the rmd limit.

What is the current bug behaviour?

The readmaxdrives value is not respected.

What is the expected correct behaviour?

The readmaxdrives value should not be violated.

Relevant logs and/or screenshots

Limit is set to 5, read mounts went over the limit:

image

Prior to the last mount for retrieve (tpsrv015) all the repacks had been going for at least 2 hours. Also, the monitoring was reporting 4 mounts for a period of time right before this mount when we had 5.

[root@ctaproductionfrontend02 ~]# cta-admin dr ls | grep REPACK
 IBM455 I4550832 tpsrv017      Up         Retrieve Transfer  44107 I41904        r_na62_1        REPACK 27595  13.1T 296.3  928500        0        -   7 -                                
 IBM455 I4550834 tpsrv015      Up         Retrieve Transfer    938 I42042 r_backup_zvalt1        REPACK   251 273.7G 284.2  928766        0        -  15 -                                
 IBM455 I4550843 tpsrv013      Up         Retrieve Transfer   8136 I44811     r_alice_raw        REPACK  1707   3.1T 382.6  928736        0        -   4 -                                
 IBM460 I4600244 tpsrv478      Up         Retrieve Transfer   8103 I46015    r_cms_b4run2        REPACK   853   2.0T 249.0  928738        0        -   5 -                                
 IBM460 I4601434 tpsrv004      Up         Retrieve Transfer   8085 I45006    r_cms_b4run2        REPACK   798   2.4T 293.2  928737        0        -  13 -                                
 IBM460 I4601441 tpsrv437      Up         Retrieve Transfer  10932 I40738     r_alice_raw        REPACK  6284   5.8T 527.6  928665        0        -  14 -                                

Looking into the cta-taped logs we can see that the reported amount of retrive mounts for the REPACK vo is 2 (existingMountsBasicTypeForThisVo="2")

[1692107539.852782000] Aug 15 15:52:19.852782 tpsrv015.cern.ch cta-taped: LVL="INFO" PID="31301" TID="31301" MSG="In Scheduler::getNextMount(): Selected next mount (retrieve)" thread="MainThread" tapeDrive="I4550834" tapePool="r_backup_zvalt1" tapeVid="I42042" vo="REPACK" mediaType="3592JC7T" labelFormat="0000" vendor="IBM" mountType="RETRIEVE" existingMountsDistinctTypeForThisTapepool="0" existingMountsBasicTypeForThisVo="2" bytesQueued="1090519040000" minBytesToWarrantMount="500000000000" filesQueued="1000" minFilesToWarrantMount="10000" oldestJobAge="3" youngestJobAge="1" minArchiveRequestAge="300" getMountInfoTime="0.116467" queueTrimingTime="0.000000" getTapeInfoTime="1.259419" candidateSortingTime="0.011519" getTapeForWriteTime="0.143288" decisionTime="0.000717" mountCreationTime="0.002645" driveStatusSetTime="0.000000" schedulerDbTime="0.119112" catalogueTime="1.402707" 

One day ago it also happened and some logs showed something completly different, rmd should be 5, retrive mouts are 6, the scheduler decides to go forward with the new mount when it should have been filtered by the logic in the code:

[1691913169.304161000] Aug 13 09:52:49.304161 tpsrv467.cern.ch cta-taped: LVL="INFO" PID="52653" TID="52653" MSG="In Scheduler::getNextMount(): Selected next mount (retrieve)" thread="MainThread" tapeDrive="I4600233" tapePool="r_alice_raw" tapeVid="I44533" vo="REPACK" mediaType="3592JC7T" labelFormat="0000" vendor="IBM" mountType="RETRIEVE" existingMountsDistinctTypeForThisTapepool="11" existingMountsBasicTypeForThisVo="6" bytesQueued="885514969125" minBytesToWarrantMount="500000000000" filesQueued="500" minFilesToWarrantMount="10000" oldestJobAge="3" youngestJobAge="3" minArchiveRequestAge="300" getMountInfoTime="0.090650" queueTrimingTime="0.000000" getTapeInfoTime="0.866431" candidateSortingTime="0.006574" getTapeForWriteTime="0.142006" decisionTime="0.000645" mountCreationTime="0.002901" driveStatusSetTime="0.000001" schedulerDbTime="0.093552" catalogueTime="1.008437" 

Possible causes

The logic in the scheduler seems fine. So the only trouble causing this should be some problems with the Objectstore during the OStoreDB::fetchMountInfo call as it is the one that feeds the Scheduler::getNextMount//Schduler::sortAndGetTapesForMountInfo logic in the scheduler

See comments below.

Edited by Pablo Oliver Cortes