Skip to content

[PGSCHED] design scheduler DB structure for archive job queue

Currently all info from the objectstore related to any job has been put into one row of the new database schema.

I suggest to go through all the columns to understand more globally:

  1. origin of the information per column (clarifies the worlflow)
  2. availability when filled/fill order (helps prevent updates [1][2] and rather do inserts when needed)
  3. which filter fields we use to filter rows systematically for selection ? (helps understand which fields should be indexed)
  4. which we systematically select for full job execution after insertion ?
  5. systematic update ? (helps to recognise where updates are really necessary and where we can insert a new row instead [1][2])
  6. mountID vs TapePool concepts could direct the Archive and Retrieve jobs in separate tables (execution related tables rather than queue related)
  7. we design a queuing system - throughput is essential !
  8. anything else you might think of ?

These should help us design a Archive related table(s) in the new PGSCHED DB schema.

The current tables have the following form (will try to make mermaid diagrams later to make the picture clearer):

TABLE:
REPACK repack_reqid status create_time mount_policy vid is_complete is_no_recall is_move is_expand_started is_expand_finished is_add_copies create_username create_host user_provided_files user_provided_bytes total_files_to_retrieve total_bytes_to_retrieve total_files_to_archive total_bytes_to_archive retrieved_files retrieved_bytes archived_files archived_bytes repack_finished_time last_expanded_fseq buffer_url failed_to_create_archive_req failed_to_archive_files failed_to_archive_bytes failed_to_retrieve_files failed_to_retrieve_bytes subreq_pb destinfo_pb
RETRIEVE job_id retrieve_reqid status creation_time mount_policy vid mount_id starttime priority storage_class retrieve_min_req_age active_copy_nb size_in_bytes archive_file_id checksumblob requester_name requester_group dst_url disk_instance disk_file_path disk_file_id disk_file_gid disk_file_owner_uid disk_system_name repack_reqid is_repack retrieve_error_report_url failure_report_url failure_report_log activity is_failed is_verify srr_activity srr_mount_policy srr_time srr_host srr_username lt_create lt_first_selected lt_completed rr_repack retrievejob_pb info_pb
ARCHIVE job_id archive_reqid status creation_time mount_policy vid mount_id start_time priority storage_class min_archive_request_age copy_nb size_in_bytes archive_file_id checksumblob requester_name requester_group src_url disk_instance disk_file_path disk_file_id disk_file_gid disk_file_owner_uid repack_reqid is_repack archive_error_report_url archive_report_url failure_report_log failure_log repack_dest_vid is_reportdecided total_retries max_total_retries retries_within_mount max_retries_within_mount last_mount_with_failure total_report_retries max_report_retries tape_pool repack_filebuf_url repack_fseq

[1] https://malisper.me/postgres-heap-only-tuples/

[2] https://www.postgresql.org/docs/current/storage-hot.html

Related also to issue #604 (closed).

Edited by Jaroslav Guenther