[scheduler] Implement repack workflows for Relational DB based scheduler
Description
Implements repack workflows for Relational DB scheduler backend.
OStoreDB
- no significant changes
Scheduler
- added few log lines and a notification about exception being thrown in case something goes wrong during
setExpandStartedAndChangeStatus
rdbms/postgres
- extending many queries which are reused for repack use-case by targetting
REPACK_table names instead of the plainARCHIVE/RETRIEVE_PENDING/ACTIVE/FAILED_QUEUEtables used for user data transfers. -
ArchiveJobQueueextendedupdateJobStatus()call for the repack use-case. For Repack, when we report successful archival this query handles the check of all other sibling rows/job with the same archive_file_id which needed to be archived too. If all the rows with the same archive_file_id are inAJS_ToReportToRepackForSuccessexcept the one being currently updated, then update all of them toReadyForDeletion. This will signal the next step that the source file can be deleted from disk otherwise it updates the status just toAJS_ToReportToRepackForSuccess. If the required update status is anything else thanAJS_ToReportToRepackForSuccessit just updates to that status. - in
ArchiveJobQueueaddedArchiveQueueJobobject to handle multi-copy cases similarly to what is done in objectstore - in
RetrieveJobQueueandRetrieveJobQueueaddedinsertBatch()method for efficient queueing of the repack requests - adding
RepackRequestTrackerto handle the repack job rows
PG schema
- added 'Cancelled' job status - wishful thinking and preparation for future Garbage collection workflows, currently not used (was used during dev for some testing only)
- creating
REPACK_+ACTIVE/PENDING/FAILEDtables for repack use-case - in
ARCHIVE_ACTIVE_QUEUE- adding
IS_SLEEPINGfor sleeep queue management -
REPACK_REQUEST_IDused for repack request management inREPACK_ARCHIVE_ACTIVE_QUEUEand sibling tables
- adding
- in
RETRIEVE_ACTIVE_QUEUE- adding
IS_SLEEPINGfor sleeep queue management -
REPACK_REQUEST_ID,REPACK_REARCHIVE_COPY_NBSandREPACK_REARCHIVE_TAPE_POOLSused for repack request management inREPACK_RETRIEVE_ACTIVE_QUEUEand sibling tables
- adding
- created
REPACK_REQUEST_TRACKINGtable for tracking each repack request and its status - created
REPACK_REQUEST_DESTINATION_STATISTICStable for tracking the statistics of the destination tapes where the files were repacked for each repack request - created
DISK_SYSTEM_SLEEP_TRACKINGto manage queue sleep for a particular disk system - created
REPACK_ARCHIVE_QUEUE_SUMMARYandREPACK_RETRIEVE_QUEUE_SUMMARYtables
Archive/Retrieve Request
-
makeJobRow()method facilittes formation of vector of rows to be inserted when bunches are available
RelationalDB
-
getDefaultRepackVo()gets the repack VO from the catalogue. In case the VO for the Archive/Retrieve Mount corresponds to the repack VO, the respective jobs will be queued to theREPACK_tables instead of the user tables (withoutREPACK_prefix) -
fetchRepackInfo()- collecting info fom the repack tracking table and the info about the destination tapes statistics as well -
cancelRepack()- method will cancel any repack which i not in status running and it will remove all the rows form the PENDING or FAILED tables as well as form the tracking table itself. The next step is to implement the functionality to cancel ongoign repack requests gracefully. This feature was never used. -
promotePendingRequestsForExpansion()- method prompts expansion (changes DB state which triggers queueing of all the retrieve jobs) for a given number of repack requests -
getNextRepackJobToExpand()- gets the jobs eligible for expansion -
getNext*RepackReportBatch()methods for reporting; successes of retrieval - usestransformJobBatchToArchive()method to query the DB and transform and move the retrieve rows into archive table rows; succeses of archival - check if all jobs were archives for that archiva ID and if so deletes the files on disk; failures of retrieve/archive move the rows into theFAILEDtable grave yard -
updateRepackRequestsProgress ()- updates the tracking and destination stat tables with the progress of the repack operation -
deleteDiskFiles()- deletes the files form the disk buffer for successful archive repack jobs (not done for failures) -
DiskSleepEntry,insertOrUpdateDiskSleepEntry(),getDiskSystemSleepStatus (),removeDiskSystemSleepEntries(),getActiveSleepDiskSystemNamesToFilter()methods for queue sleep logic based on the disk system name -
RepackRequest::addSubrequestsAndUpdateStats()- has been refactored completely to avoid thegotostatements, but still keeping the logic of the same method used in the objectstore version of this method - We might actually think of moving this method out from the particular implementation of the Scheduler DB and keep it on the Scheduler logic level - several other RepackRequest methods (
insert(),failed(), etc.) to handle basic DB operations with the repack request -
RepackJobStatusvsRepackRequestStatus- both statuses were introduced in the past copy-paste form objectstore, but so far I see no need for both of them - keeping them for ow just to see if I did not miss anything and we shall remove them later and have only 1 status type to work with.
TapeMountDecisionInfo
- flagging the mount type as repack mount (i.e. asking it to work with the
REPACK_tables only) in case the default repack VO configured in the catalogue corresponds to the VO requested for this mount.
CI stress test
- separating
delete_files_from_eos_and_tapeslogic just for convenience (just to be allowed to use it at different places if needed inclient_stress_ar.sh) - adding
-dflag torepack_systemtest.shmaking printout of the DEBUG information about all tape content optional and enabling it everywhere in order not to change the logic of the current CI tests. -
repack_helper.shadding method to list array of VIDs with files eligible for repacking -
stress_test.shadding a repack stress test case after the archive and retrieve test is over assuming no files will be deleted at the end of this previous step
Checklist
-
Documentation reflects the changes made. -
Merge Request title is clear, concise, and suitable as a changelog entry. See this link
References
Closes #1228
Edited by Jaroslav Guenther