Skip to content

[scheduler] Implement repack workflows for Relational DB based scheduler

Description

Implements repack workflows for Relational DB scheduler backend.

OStoreDB

  • no significant changes

Scheduler

  • added few log lines and a notification about exception being thrown in case something goes wrong during setExpandStartedAndChangeStatus

rdbms/postgres

  • extending many queries which are reused for repack use-case by targetting REPACK_ table names instead of the plain ARCHIVE/RETRIEVE_PENDING/ACTIVE/FAILED_QUEUE tables used for user data transfers.
  • ArchiveJobQueue extended updateJobStatus() call for the repack use-case. For Repack, when we report successful archival this query handles the check of all other sibling rows/job with the same archive_file_id which needed to be archived too. If all the rows with the same archive_file_id are in AJS_ToReportToRepackForSuccess except the one being currently updated, then update all of them to ReadyForDeletion. This will signal the next step that the source file can be deleted from disk otherwise it updates the status just to AJS_ToReportToRepackForSuccess. If the required update status is anything else than AJS_ToReportToRepackForSuccess it just updates to that status.
  • in ArchiveJobQueue added ArchiveQueueJob object to handle multi-copy cases similarly to what is done in objectstore
  • in RetrieveJobQueue and RetrieveJobQueue added insertBatch() method for efficient queueing of the repack requests
  • adding RepackRequestTracker to handle the repack job rows

PG schema

  • added 'Cancelled' job status - wishful thinking and preparation for future Garbage collection workflows, currently not used (was used during dev for some testing only)
  • creating REPACK_ + ACTIVE/PENDING/FAILED tables for repack use-case
  • in ARCHIVE_ACTIVE_QUEUE
    • adding IS_SLEEPING for sleeep queue management
    • REPACK_REQUEST_ID used for repack request management in REPACK_ARCHIVE_ACTIVE_QUEUE and sibling tables
  • in RETRIEVE_ACTIVE_QUEUE
    • adding IS_SLEEPING for sleeep queue management
    • REPACK_REQUEST_ID, REPACK_REARCHIVE_COPY_NBS and REPACK_REARCHIVE_TAPE_POOLS used for repack request management in REPACK_RETRIEVE_ACTIVE_QUEUE and sibling tables
  • created REPACK_REQUEST_TRACKING table for tracking each repack request and its status
  • created REPACK_REQUEST_DESTINATION_STATISTICS table for tracking the statistics of the destination tapes where the files were repacked for each repack request
  • created DISK_SYSTEM_SLEEP_TRACKING to manage queue sleep for a particular disk system
  • created REPACK_ARCHIVE_QUEUE_SUMMARY and REPACK_RETRIEVE_QUEUE_SUMMARY tables

Archive/Retrieve Request

  • makeJobRow() method facilittes formation of vector of rows to be inserted when bunches are available

RelationalDB

  • getDefaultRepackVo() gets the repack VO from the catalogue. In case the VO for the Archive/Retrieve Mount corresponds to the repack VO, the respective jobs will be queued to the REPACK_ tables instead of the user tables (without REPACK_ prefix)
  • fetchRepackInfo() - collecting info fom the repack tracking table and the info about the destination tapes statistics as well
  • cancelRepack() - method will cancel any repack which i not in status running and it will remove all the rows form the PENDING or FAILED tables as well as form the tracking table itself. The next step is to implement the functionality to cancel ongoign repack requests gracefully. This feature was never used.
  • promotePendingRequestsForExpansion() - method prompts expansion (changes DB state which triggers queueing of all the retrieve jobs) for a given number of repack requests
  • getNextRepackJobToExpand() - gets the jobs eligible for expansion
  • getNext*RepackReportBatch() methods for reporting; successes of retrieval - uses transformJobBatchToArchive() method to query the DB and transform and move the retrieve rows into archive table rows; succeses of archival - check if all jobs were archives for that archiva ID and if so deletes the files on disk; failures of retrieve/archive move the rows into the FAILED table grave yard
  • updateRepackRequestsProgress () - updates the tracking and destination stat tables with the progress of the repack operation
  • deleteDiskFiles() - deletes the files form the disk buffer for successful archive repack jobs (not done for failures)
  • DiskSleepEntry , insertOrUpdateDiskSleepEntry(), getDiskSystemSleepStatus (), removeDiskSystemSleepEntries(), getActiveSleepDiskSystemNamesToFilter() methods for queue sleep logic based on the disk system name
  • RepackRequest::addSubrequestsAndUpdateStats() - has been refactored completely to avoid the goto statements, but still keeping the logic of the same method used in the objectstore version of this method - We might actually think of moving this method out from the particular implementation of the Scheduler DB and keep it on the Scheduler logic level
  • several other RepackRequest methods (insert(), failed(), etc.) to handle basic DB operations with the repack request
  • RepackJobStatus vs RepackRequestStatus - both statuses were introduced in the past copy-paste form objectstore, but so far I see no need for both of them - keeping them for ow just to see if I did not miss anything and we shall remove them later and have only 1 status type to work with.

TapeMountDecisionInfo

  • flagging the mount type as repack mount (i.e. asking it to work with the REPACK_ tables only) in case the default repack VO configured in the catalogue corresponds to the VO requested for this mount.

CI stress test

  • separating delete_files_from_eos_and_tapes logic just for convenience (just to be allowed to use it at different places if needed in client_stress_ar.sh)
  • adding -d flag to repack_systemtest.sh making printout of the DEBUG information about all tape content optional and enabling it everywhere in order not to change the logic of the current CI tests.
  • repack_helper.sh adding method to list array of VIDs with files eligible for repacking
  • stress_test.sh adding a repack stress test case after the archive and retrieve test is over assuming no files will be deleted at the end of this previous step

Checklist

  • Documentation reflects the changes made.
  • Merge Request title is clear, concise, and suitable as a changelog entry. See this link

References

Closes #1228

Edited by Jaroslav Guenther

Merge request reports

Loading