CTA stress test deployment
Problem to solve
Definitions
Let's first agree on definition of high rates and their purpose for stress testing CTA software:
-
minimum high rate
= 4 times the current production peak (50GB/s with 2 GB files = 25 Hz), we need the stress test to pass 100 Hz sustainable throughput. This is equivalent to 200 GB/s in production with average file size of 2GB. For Run 4 we expect 125 GB/s (63Hz), which means we should scale the performance of the of the new scheduler backend to be able to achieve 250Hz. In real world it is more the other way around, the file sizes are being increased in order to fit into the limited metadata rates of the DB-based scheduling systems like Rucio, Dirac etc.. -
maximum high rate
= bottleneck of the system, we push until we see the backends reach their limit, this is the rate with which we could theoretically do repack operations.
The Problem
The current CTA stress test deployment based on the CI minikube setup and running on 1 real HW machine struggles to sustainably probe for these two high rate limits due to the bottlenecks which are not the CTA software itself. In particular, at this moment this is MHVTL not working as expected stress test on Alma 9.
The current deployment is not easily scalable in addition as we can not add or remove the system parts to increae the various loads coming from the CTA subsystems (EOS instances queueing, reading, writing, tape servers writing and reporting etc.). As of now, everything runs encapsulated in 1 minikube, this comes with the following dangers:
- CTA disk reporter running in 1 minikube can not contact an EOS instance running in another minikube on another machine (services are not exposed to the outside network)
- EOS, MHVTL and CTA parts all use resources of that 1 machine only (lots of IO in addition to our demand for highest rate possible)
- not reflecting current production-like deployment where these services are separated
Stakeholders
Everyone will benefit.
Proposal
Gathering more information about the current state of the deployment
In particular we need:
-
New RPMs for MHVTL compiled form the HEAD and running properly on Alma 9.
-
Understand the MHVTL slowness and why the drive speed decreases over time** (currently hiding partially with running with the in-memory backend).
-
Reflect on the long term support of this project as we do rely on it heavily for testing :
- forking the repo and maintaining it ourselves
- other VTL repos out there which are more alive (with releases published regularly)
- our main dev solution for VTL
- anything else ?
Solutions for scaling the current deployment:
Multi-machine setup (VMs) with a minikube per machine
-
Advantages:
- increases the overall pressure on the CTA scheduling
- if the HW IO does not become bottleneck this could be good enough for now
- Problem: disk reporters do not see other EOS instances for job reporting.
-
Possible solutions:
- expose the EOS service with configuring
LoadBalancing
/NodePort
/similar service per minikube and opening appropriate port per machine, this comes naturally with reconfiguration of all the minikube EOS services - constrained reporting (we could configure a separate root entry for objectstore, we can be creative..), but in all this is a change in codebase of CTA scheduling and makes disk reporters report only jobs from their logical library or tape pool to the EOS service within the minikube. This code change would take us further from production codebase and could make the comparison between the CTA Scheduler Backends unfair. For example, the fact that Objectstore does not have and easy way to fetch jobs/reports/objects based on multiple indices could make the logical library selection an optimisation challenge in a part of the code we do not want to develop further. In addition, we are facing originally a test deployment problem which should be addressed rather than changing the codebase.
- anything else ?
- expose the EOS service with configuring
-
Dangers:
- _ there could still be IO contention happening since all the services run on the same machine (EOS, CTA, MHVTL) and often use the same disk and memory.
- it does not allow to easily scale the various system services separately if needed (helm might help but still would be running within same VM and make IO problem even worse)
Proper kubernetes cluster
-
Advantages:
- services separates per pod running on several VMs, no IO contention
- highly scalable per system sub-service
-
Problem nr. 1: lots of work, we do not have such setup
-
_Possible solutions: _
- longer term dev effort, we are on good track with HELM charts being introduced
- as of now, we could increase the number of tape servers in the minikube using our current HELM charts, if the HW IO can sustain it, this could be medium term solution which is good enough and useful even later when we would have kubernetes cluster setup
-
Problem nr. 2: unknown behaviour of MHVTL in such setup
-
Possible solutions:
- evaluate MHVTL in such setup (once we manage to have it)
- remove MHVTL and run the cluster with MHVTL outside of Kubernetes
Proposed immediate term action plan (before CHEP)
-
produce MHVTL RPMs and run with minikube stress test on both backends and see if we manage to hit CTA SW bottleneck or stay limited by the drive speed -
Investigate how to use media type NULL
feature and if the throughput increases let's use it for stress tests- not very useful, throughput still slow
-
solve a problem of impossible mount of SSD resources within the podman pod within minikube
- mount EOS data disk and quarkdb to separate SSDs
- solved
-
try increase the number of tape servers in the current minikube setup
- if the previous point does not work due to high IO on the underlying HW, we go for multi-machine setup with EOS service exposure