[Meta ticket] Helm CI integration Overview

Problem to solve

The work done my Damian with Helm was a nice step forward to improve our CI and bring some interesting features into it and to simplify it. But, there are still too many scripts and logic that could be moved to charts to simplify things even further; we still have sharp edges around.

Now that we are done with all Alma9 migration related changes it is time to step up our CI game. >:)

High Level Targets // Wishlist 😄

The following list is a set of requirements that our CI should be able to achieve, it is not to be addressed all at the same time, but in the iterative process of improving our CI and its integration with Helm they should be taken into account in the beginning so that we implement X by doing Y, but this Y is later a problem for the integration of Z and it would require throwing away Y. Please feel free to add any other requirements.

Cluster Configuration: the Helm charts should be able to (re)configure a deployed cluster in a trivial way from the developer/user perspective. Providing new values to the chart should be enough to swap between EOS versions, CTA versions, scheduler type, mhvtl configurations and disk buffer type (EOS/dCache).
- Secret Management: should be inserted as ""secrets"" into the cluster when creating it (and updated when needed), no more whatever credentials path dot yaml.
Monitoring Integration: at some point it would be interesting to integrate our monitoring setup
CTA Public Ops: the public tools has grown in size since its conception and relying on manual testing of the commands is not sustainable for the long run. The tools should have unit testing and we should also run integration tests in the CTA repo to check they correctly work (with triggered pipelines like the EOS setup).

Notes

Study and understand the capabilities and design of Helm before going forward. We should be able to get rid of many more scripting logic.
All setup configurations MUST be tested before merging into main. This is:
- Stress test: https://tapeoperations.docs.cern.ch/dev/launch_stresstest/
- Pipelines, all combinations of pipeline parameters. This is:
  - Default
  - Ninja build
  - No Oracle
  - PostgreSQL Scheduler (when it is configured)
  - Trigger Pipelines from EOS tags
- Dev environment
Reproducibility of the development environment outside CERN should also be taken into account.

Edited Oct 17, 2024 by Pablo Oliver Cortes