Repack VO not respecting the `readmaxdrive` limits
Summary
Repack VO readmaxdrives limit is not respected when repacking tapes from multiple VOs/ tape pools. After an initial look at the code all the logic seems correct.
Steps to reproduce
Add repacks from different VOs/tape pools exceeding the rmd
limit.
What is the current bug behaviour?
The readmaxdrives
value is not respected.
What is the expected correct behaviour?
The readmaxdrives
value should not be violated.
Relevant logs and/or screenshots
Limit is set to 5, read mounts went over the limit:
Prior to the last mount for retrieve (tpsrv015) all the repacks had been going for at least 2 hours. Also, the monitoring was reporting 4 mounts for a period of time right before this mount when we had 5.
[root@ctaproductionfrontend02 ~]# cta-admin dr ls | grep REPACK
IBM455 I4550832 tpsrv017 Up Retrieve Transfer 44107 I41904 r_na62_1 REPACK 27595 13.1T 296.3 928500 0 - 7 -
IBM455 I4550834 tpsrv015 Up Retrieve Transfer 938 I42042 r_backup_zvalt1 REPACK 251 273.7G 284.2 928766 0 - 15 -
IBM455 I4550843 tpsrv013 Up Retrieve Transfer 8136 I44811 r_alice_raw REPACK 1707 3.1T 382.6 928736 0 - 4 -
IBM460 I4600244 tpsrv478 Up Retrieve Transfer 8103 I46015 r_cms_b4run2 REPACK 853 2.0T 249.0 928738 0 - 5 -
IBM460 I4601434 tpsrv004 Up Retrieve Transfer 8085 I45006 r_cms_b4run2 REPACK 798 2.4T 293.2 928737 0 - 13 -
IBM460 I4601441 tpsrv437 Up Retrieve Transfer 10932 I40738 r_alice_raw REPACK 6284 5.8T 527.6 928665 0 - 14 -
Looking into the cta-taped logs we can see that the reported amount of retrive mounts for the REPACK vo is 2 (existingMountsBasicTypeForThisVo="2"
)
[1692107539.852782000] Aug 15 15:52:19.852782 tpsrv015.cern.ch cta-taped: LVL="INFO" PID="31301" TID="31301" MSG="In Scheduler::getNextMount(): Selected next mount (retrieve)" thread="MainThread" tapeDrive="I4550834" tapePool="r_backup_zvalt1" tapeVid="I42042" vo="REPACK" mediaType="3592JC7T" labelFormat="0000" vendor="IBM" mountType="RETRIEVE" existingMountsDistinctTypeForThisTapepool="0" existingMountsBasicTypeForThisVo="2" bytesQueued="1090519040000" minBytesToWarrantMount="500000000000" filesQueued="1000" minFilesToWarrantMount="10000" oldestJobAge="3" youngestJobAge="1" minArchiveRequestAge="300" getMountInfoTime="0.116467" queueTrimingTime="0.000000" getTapeInfoTime="1.259419" candidateSortingTime="0.011519" getTapeForWriteTime="0.143288" decisionTime="0.000717" mountCreationTime="0.002645" driveStatusSetTime="0.000000" schedulerDbTime="0.119112" catalogueTime="1.402707"
One day ago it also happened and some logs showed something completly different, rmd should be 5, retrive mouts are 6, the scheduler decides to go forward with the new mount when it should have been filtered by the logic in the code:
[1691913169.304161000] Aug 13 09:52:49.304161 tpsrv467.cern.ch cta-taped: LVL="INFO" PID="52653" TID="52653" MSG="In Scheduler::getNextMount(): Selected next mount (retrieve)" thread="MainThread" tapeDrive="I4600233" tapePool="r_alice_raw" tapeVid="I44533" vo="REPACK" mediaType="3592JC7T" labelFormat="0000" vendor="IBM" mountType="RETRIEVE" existingMountsDistinctTypeForThisTapepool="11" existingMountsBasicTypeForThisVo="6" bytesQueued="885514969125" minBytesToWarrantMount="500000000000" filesQueued="500" minFilesToWarrantMount="10000" oldestJobAge="3" youngestJobAge="3" minArchiveRequestAge="300" getMountInfoTime="0.090650" queueTrimingTime="0.000000" getTapeInfoTime="0.866431" candidateSortingTime="0.006574" getTapeForWriteTime="0.142006" decisionTime="0.000645" mountCreationTime="0.002901" driveStatusSetTime="0.000001" schedulerDbTime="0.093552" catalogueTime="1.008437"
Possible causes
The logic in the scheduler seems fine. So the only trouble causing this should be some problems with the Objectstore during the OStoreDB::fetchMountInfo
call as it is the one that feeds the Scheduler::getNextMount
//Schduler::sortAndGetTapesForMountInfo
logic in the scheduler
See comments below.