The work done my Damian with Helm was a nice step forward to improve our CI and bring some interesting features into it and to simplify it. But, there are still too many scripts and logic that could be moved to charts to simplify things even further; we still have sharp edges around.
Now that we are done with all Alma9 migration related changes it is time to step up our CI game. >:)
High Level Targets // Wishlist
The following list is a set of requirements that our CI should be able to achieve, it is not to be addressed all at the same time, but in the iterative process of improving our CI and its integration with Helm they should be taken into account in the beginning so that we implement X by doing Y, but this Y is later a problem for the integration of Z and it would require throwing away Y. Please feel free to add any other requirements.
Cluster Configuration: the Helm charts should be able to (re)configure a deployed cluster in a trivial way from the developer/user perspective. Providing new values to the chart should be enough to swap between EOS versions, CTA versions, scheduler type, mhvtl configurations and disk buffer type (EOS/dCache).
Secret Management: should be inserted as ""secrets"" into the cluster when creating it (and updated when needed), no more whatever credentials path dot yaml.
Monitoring Integration: at some point it would be interesting to integrate our monitoring setup
CTA Public Ops: the public tools has grown in size since its conception and relying on manual testing of the commands is not sustainable for the long run. The tools should have unit testing and we should also run integration tests in the CTA repo to check they correctly work (with triggered pipelines like the EOS setup).
Notes
Study and understand the capabilities and design of Helm before going forward. We should be able to get rid of many more scripting logic.
All setup configurations MUST be tested before merging into main. This is:
The biggest first step is to clean up create_instance.sh, delete_instance.sh, and run_systemtest.sh
During clean-up, the biggest action points are as follows:
Stop relying on search/replace and doing this properly with Helm values or --set flags where necessary
Removing the kubectl exec commands and doing this via the Helm chart as well. These commands are not reliable; if Kubernetes restarts a pod everything will break as these exec commands are not executed on restart. Where possible, we should try to clean up the init scripts as much as possible.
Ensure we rely as little as possible on the environment in which the setup is run. This means we shouldn't expect particular files in particular directories. Ideally, the only thing we should rely on is a small collection of secrets needed to access e.g. the registry. This will make the setup more portable
Once this is done, it should be a lot easier to integrate our chart with EOS or dCache charts
As I reported yesterday we have various configuration files in CI that are pulled from various places and not yet coming from configmaps: they should all come from configmaps.
Helm should generate these from values and a base file that can be included by helm too: helm should filter out the configuration entries from the example file for the ones that are populated from values.
This will allow us to consume the same config file for rpm inclusion (as example) and in CI.
eos side is OK but there is nothing for the CTA side for now. We need to be able to set the CTA instance name and other things from helm and all other parameters.
Now that we are using helm it should be easy to define how many tape servers we want to run: there is no reason to keep the count as a static 2. Increasing this would be interesting in the context of stress tests.
So here is an overview of the things we want in our Helm setup:
Must have
#901 (closed) Move the generation of the *.conf files in the startup scripts to config maps and allow the user to pass these config maps.
#890 (closed) Make the number of tape servers configurable
#888 (closed) A separate chart for the Catalogue DB (this only does something in case the postgres catalogue is used)
#900 (closed) A separate chart for the Scheduler DB (only for configmaps)
Should have
Reduce the number of kubectl exec commands to ensure we can smoothly redeploy/upgrade charts
Allow the EOS Helm charts to be used in our setup
Allow the dCache Helm charts to be used in our setup
In general, reduce the amount of required scripting as much as possible
Stop relying on sed and overwrite any required values using --set (with sed it's not trivial to redeploy things and it makes things difficult to read)
Improved logging in the scripts (e.g. create_instance.sh) to ensure this is more readable (it has been improved moderately; not yet up to a standard)
Nice to have
Integrate the monitoring Helm chart with the Helm charts in the CTA repo
#898 (closed) Simplify the create_instance.sh script so that it doesn't do as many things as once (things such as running particular tests should be separate)
Improve readability of the run_systemtest.sh script
Add proper labels to each pod
Be very explicit about what external resources create_instance.sh might be relying on by passing them as parameters throughout the entire chain (so also in run_systemtest.sh)
We can (and probably should) shift around what belongs in what category, so feel free to do that or add more things
On the install or upgrade(?) of the charts, the currently virtual hardware resources available should be scanned and loaded into the cluster to be consumed via lookup from the templating system. And the user should just need to specify the number of tape servers with the amount of tape drives per server to be deployed with error if user request exceeds available resources.
In the future we could go for fancier stuff like configuring multiple libraries. This allows to reproduce production setups which would be useful to easily test the supply logic scripts and probably other things.
To follow up a little bit on my last comment. This is not something to do in this refactor, just as context, to be taken into account for the way se set up tape server related things.
The new mhVTL RPMs we are generating via the pipelines in https://gitlab.cern.ch/cta/cta-dependencies do not come with any configuration. The installation of the RPM expects some config files to already exist in /etc/mvhtl (this is on the host machine, not the cluster). This is because we already provide them in https://gitlab.cern.ch/cta/minikube_cta_ci and they get placed in the config directory before RPM installation in the .
I closed #177 (closed) as we should not be coupling together the RPM with a specific library configuration. That is a problem to solve somewhere in between the creation of the machine and the installation of helm charts. As of now, not sure how this would play out for changing tape virtual hardware deployment on a cluster recreation.
Actually I think atm the refactor already "solves" this issue (correct me if I'm wrong). Essentially, the tape configuration can either be manually provided to the Helm installation or it can be automatically generated (similar to how the minikube setup generates it atm); both in the form of a values.yaml file. This gives the flexibility of providing it yourself, but it also does not require the user to do additional setup beforehand if they don't want to.
This is an example output of the create_instance script:
Of course this still assumes the tape servers use only a single library configuration (i.e. we can't have multiple tape servers each with different configurations), but improving that is for later.
So what do we mean by Integrate with the Helm charts of EOS, is this about integrating their k8s deployment [1] into ours ? Or extending and refactoring our ctaeos ? The former could have the advantage (to be confirmed) that there should be already an easy option to scale the number of FSTs and mount them to different disks on the host node, which is what we need for scaling tests later as well.
This is about integrating their Helm charts into ours so that we indeed don't need to manage our own ctaeos anymore. I've reworded it now to be a bit more clear
I have checked the boxes that the new Helm setup in !660 (merged) covers. The MR is still WIP (need to iron out a few wrinkles) but the vast majority of what I want to do is there. I will try to get all the CI tests to pass properly today and then tomorrow I will work on the procedure of doing the catalogue upgrade in this dev setup (not yet with the production clone).
The only thing I foresee I might still have to do is make all the subcharts in the CTA chart top-level charts so that it is easy to individually redeploy them.
Btw, I know it's a (too) large MR, but given the deadline I had to do quite a few things fast and did not have time to wait for individual MR merges. Anyway, I provided a description of what I did and would be happy to sit down and explain everything if someone is interested.
In general, I could do a presentation during a dev at some point to highlight the new structure of everything; it should be a lot more understandable now that everything has been split up in a (hopefully) better way.
Any remarks on the MR are welcome; I won't merge it yet until we (I) have ironed out the wrinkles, so that will probably take a little while still.
Alright MR is ready for review: !660 (merged). It should pass all the tests, but I will link specific pipelines below to show that it passes both the normal pipeline and the no-Oracle pipeline.
@poliverc@afonso can I assign you to the MR? Again, I know it's a large MR; if you have any questions or if you want to sit down and walk through it together, please let me know.
I'm thinking a bit about how I did the tape server scalability and can't help but feeling that the current approach with the range loop is rather hacky and just manually doing what Kubernetes should be doing natively with e.g. deployments (which would also be more intuitive and easier to maintain/extend).
I have a few questions though to clarify:
Is there always a 1:1 ratio between the number of taped and rmcd processes? I.e. does every taped process need a corresponding rmcd process?
In our setup is a tape server physically connected to its (and only its) drive(s) somehow? In other words, does it matter where the taped/rmcd processes run (my guess is that the answer is yes, but don't know the details). The reason I am asking has to do with the concept of "having a single tape server be responsible for might drives". I want to know whether we would have a "pod responsible for multiple drives" (because that would require spawning additional containers within a pod) or whether we just have that each pod can be responsible for a single drive (in which case pure replication is enough to manage all drives)
When spawning the CTA chart, you provide a library configuration containing the library device, drive names, tapes etc. (this is not different from before)
The tape servers are now handled by a statefulset. This ensures we have a stable DNS entry for each of them and we can map them easily to drives. Each tpsrv pod will automatically get the name tpsrv-xx.
Based on the drives provided in the library configuration, a configmap will be generated for each drive.
For each drive, we will then have a replica in the statefulset responsible for that particular drive. It will do so by mounting the configmap(s) of the drive corresponding to its index.
This essentially accomplishes the same thing as before, but it uses Kubernetes as it is meant to be used, makes it easy to redeploy tape servers, and it is trivial to change which drives should be used now (simply update the library configuration you pass in).
Is there always a 1:1 ratio between the number of taped and rmcd processes? I.e. does every tapedprocess need a correspondingrmcd process?
No, one rmcd per tape server. A tape server needs: 1 rmcd and [1-*] taped processes.
In our setup is a tape server physically connected to its (and only its) drive(s) somehow? [...]
Physically, hey are connected via QLogic HBA card: Drive <-> optic fibre <-> PCIe card with optic fibre ports. For the virtual setup we don't care that much. The taped process must be able to connect to the rmcd, if you look into the cta-taped.conf.example file you can see that taped needs the rmc port to contact the rmc daemon (rmcd).
"having a single tape server be responsible for might drives"
Imo, if we don't want to shuffle things too much 1 pod must follow the same logic as tapeserver: 1 rmcd container, 1 or many tape servers. As long as they are connecting to the same rmc should be enough, although I will think about this again next week.
Okay thanks for clearing that up. Then the only limitation of the current implementation is that a tape server pod always has 1 rmcd process and 1 taped process.
To solve this we should probably be able to spawn multiple taped containers in each tpsrv pod, but then the mapping of drives to pods gets a bit more complex. I.e. how do we cleanly specify/determine which drives a pod should be responsible for?
We could do this automatically if we say that all tpsrv pods get the same number of taped processes (minus some sort remainder for a given set of pods). For example, 7 drives available and I want 2 taped processes per pod -> 4 pods, 3 of which will have 2 taped processes and 1 of them will have 1 taped process. That would work, but I'm not sure whether this is a valid constraint to have...
The more flexible alternative to this is to force the user to input a configuration with a list of drive sets. Each set of drives is then assigned to a particular pod (and 1 rmcd + x taped processes are within said pod, where x depends on how many drives were assigned).
I think this configuration is the way to. I'm working on that now and should actually simplify things a bit and make them more flexible and configurable (even allow things like multiple library devices to be used).
Just for reference, below is a list of items that we should implement at one point or another to have a nice consistent setup. This is in no particular order, although the EOS chart has priority:
#960 (closed) Move from pod definitions to deployments/statefulsets
#1008 (closed) Improve naming consistency. Prefix all cta-related pods with cta- (ensure consistency with EOS chart and allows for easy distinction between cta pods and other pods running in the same namespace)
#933 (closed) Mount the init scripts as volumes in the relevant pods instead of baking them into the docker image
#1009 (closed) Move all of the cta/ subcharts into their own top-level charts. We keep cta as an umbrella chart, but the subcomponents can now also exist as their own charts
#1007 (closed) Remove split between registry and repository when specifying image
Less important
#1011 (closed) Define proper readiness probes instead of relying on *_READY files
#1012 (closed) Move the permission/ownership modifications of keytabs into init containers
Get rid of the init_pod.sh script:
Remove claimLogs volume. This requires an update to the monitoring chart in the stress test repo.
Move the setting of the kernel proc pattern to a Daemonset
Change the way in which the reverse DNS fix is done for xrootd
Stop running pods in privileged mode wherever possible. Requires the removal of the init_pod.sh script
After the EOS chart is complete, the dCache integration can also start.