Alma9 runner based stress tests
Description
This ticket is about most issues met with restoring stress tests on a Alma9 specialised runner outside of CTA repo content for CI code.
Runner installation
We could not use Puppet because of puppet firewall messing with kubernetes requirements, but at least it was fine to install a box with puppet and then disable puppet agent.
This is not enough anymore as Alma9 is re-configuring some subsystems more aggressively than CentOS7 and a single puppet agent run is enough to prevent the runner from running kubernetes.
Machines have to be installed outside puppet from scratch.
kernel installation
Because of #743 we have to stick with kernel 5.14.0-362.* and use these workarounds to stay in Alma 9.3 version as much as possible.
This is likely to be a hidden time bomb as the base system is frozen and other key components are evolving and will require updates.
MHVTL performance
Initial stress tests were affected by MHVTL very inefficient use of underlying disk:
- 60Hz of 1kb tape file write resulted in 200MB/s write to disk (2k write amplification does not come from file headers on tape...)
- 500k 1kb tape files require over 2GB of MHVTL disk backend storage
Before further investigation it looks like MHVTL is writing much more than the files: likely rewriting at least part of the MHVTL tape file until tape is full. This explains a big part of the slowdown when writing to tape in the stress test: the more full the MHVTL tape files are: the bigger the write amplification and the slower files are getting written to virtual tapes.
In memory MHVTL tape files offer flat tape write performance: 0 to 2M files written to MHVTL tapes in memory:
Disk back MHVTL tape files causing performance degradation and crawling performance (unusable for stress tests): 0 to 1.2M files written to MHVTL tapes on SSD: