[Main ticket] Split objectstore backend between repack and user instances
Situation
Over the last year, we have worked on improving our repack procedure, both on development and operations sides.
This involved several actions, in particular:
- #83 (closed): Setup new tape state REPACKING
- #31 (closed): Allow VO override for repack
- #546 (closed): Limit the number of repack sub-requests that can be expanded at the same moment
- Link to all repack issues
All of these changes have introduced a logical separation between user and repack jobs. Both are in different domains - User and Operations - as illustrated here.
However, they are not yet fully independent. Because they share a single object store, performance issues in repack domain can directly affect the quality of service for the end-users.
This has been demonstrated on several occasions in production (for example, with a larger-than-acceptable time locking time: https://gitlab.cern.ch/cta/operations/-/issues/1186).
Due to these performance issues, and to safeguard the quality of service provided for our users, we should separate the object store into independent backends: one for repacks; the other for user operations.
Furthermore, separating these backends will make it simpler to migrate the backend to PostgreSQL in the near future. Instead of migrating the whole backend at once, we can test it first with repack.
Task
-
Creating a separate objectore for the repack queues backend.
-
Separate the production tape servers into 2 different
CTA instances
, sharing the same CTA catalogue.
- The current
production
instance will be used for users/experiments queued jobs, with its 2 current frontends. - The new
productionrepack
instance will be used only for repack queues, with 2 new frontends (ctaproductionrepackfrontendops and ctaproductionrepackfrontenddev).
Actions
Configure a separate repack
dedicated objectstore
The first step is to setup a new ceph instance for the repack activities.
This backend is needed before proceeding with any other changes/migrations.
Reference issue: https://gitlab.cern.ch/cta/operations/-/issues/1195
Owner: @jleduc
Completed and tested:
-
YES
'cta-taped' service multiple drives / schedulerDB
This will add support for multiple tape drives, where each cta-taped
service is run independently, with a separate configuration file.
It will be a base for supporting different objectstore configurations.
Reference issue: https://gitlab.cern.ch/cta/operations/-/issues/1254
Owner: @jleduc @poliverc
Completed and tested:
-
YES
Mechanism to support multiple object stores
On the dev side, we must allow the CTA nodes (tape servers and frontend) to swap between multi-objectstore configurations easily.
Reference issue: #569 (closed)
Owner: @poliverc
Completed and tested:
-
YES
Read current TPCONFIG configuration from cta-taped.conf
TPCONFIG is going away and it has been decided to move those config lines into taped's config file.
Reference issue: #576 (closed)
Owner: @poliverc
Completed and tested:
-
YES
Add configuration on CTA frontend to block user and/or repack operations
Do not allow repack operations on the user objectstore backend, and vice-versa.
This should be controlled by a configuration to the CTA frontend.
Reference issue: #573 (closed)
Owner: @poliverc
Completed and tested:
-
YES
Include 'instanceName' and 'schedulerBackendName' in CTA logs
Reference issue: #588 (closed)
Owner: @poliverc
Completed and tested:
-
YES
Pre-production deployment of the CTA repack instance
In order to update and/or configure all the operational and monitoring tools for the new repack instance, we need first to deploy this new configuration in pre-production.
This will also allow us to properly test everything in a safe environment.
Reference issue: TBD
Owner: TBD
Completed and tested:
-
YES
Configure operation/monitoring tools for the new repack instance
This is a pre-requirement before the production deployment is setup.
Reference issue: TBD
Owner: TBD
Completed and tested:
-
YES
Production deployment of the CTA repack instance
Final deployment in production
Reference issue: TBD
Owner: TBD
Completed and tested:
-
YES
Action dependency graph
graph LR
REP_A[Ops#1195 \n Configure a separate 'repack' dedicated objectstore]
REP_B[CTA#569 \n Mechanism to support multiple object stores]
REP_C[Pre-production deployment \n of the CTA repack instance]
REP_D[Configure operation/monitoring \n tools for the new repack instance]
REP_E[Production deployment of \n the CTA repack instance]
REP_F[CTA#573 \n Add configuration on CTA frontend to \n block user and/or repack operations]
REP_G[Ops#1254 \n 'cta-taped' service multiple drives / schedulerDB]
REP_H[CTA#576 \n Read current TPCONFIG configuration from cta-taped.conf]
REP_I[CTA#588 \n Include 'instanceName' and 'schedulerBackendName' in CTA logs]
REP_A --> REP_C
REP_G --> REP_B
REP_B --> REP_C
REP_F --> REP_B
REP_C --> REP_E
REP_D --> REP_E
REP_H --> REP_B
REP_I --> REP_D