Magnum leaks certificates from deleted clusters
When a cluster is deleted the certificates stored in barbican (or in the magnum DB as we use it in TN) must be deleted. For most deletions, magnum does not delete the two certificates (one of the etcd CA and one for the API).
I believe this is due to a race condition in the conductor.
In magnum GPN were we use barbican it has an effect in barbican DB where all certificates remain stale and NOT soft deleted. In magnum TN the certs polute the magnum DB.
For example in the TN:
select count(id) from cluster;
+-----------+
| count(id) |
+-----------+
| 27 |
+-----------+
select count(x509keypair.uuid) from x509keypair join cluster on x509keypair.uuid=cluster.ca_cert_ref OR x509keypair.uuid=magnum_cert_ref;
+-------------------------+
| count(x509keypair.uuid) |
+-------------------------+
| 54 |
+-------------------------+
But for the leaked certs:
select x509keypair.uuid as x509keypair_uuid, cluster.ca_cert_ref as cluster_ca_cert_ref, cluster.magnum_cert_ref as cluster_magnum_cert_ref from x509keypair left outer join cluster on x509keypair.uuid=cluster.ca_cert_ref OR x509keypair.uuid=magnum_cert_ref WHERE cluster.ca_cert_ref IS NULL OR cluster.magnum_cert_ref IS NULL;
...
| fd22be7b-d4b8-4ab8-a4ad-01ca157da82c | NULL | NULL |
| fd69e44d-b518-4621-a30f-127835b3a4ae | NULL | NULL |
| ffa3f201-2fa2-409b-98db-4427949ff6b8 | NULL | NULL |
+--------------------------------------+---------------------+-------------------------+
330 rows in set (0.00 sec)
In magnum GPN the total certs are in the order of ~3 millions and we have less that 1000 clusters. The barbican DB is 40GB with 99.99% of magnum's leaked certs.
We need to clean them up and have something not letting things grow the DB indefinitely.
Solution for TN, clean up the DB regularly in the DB with a cronjob in kops-openstack. Solution for GPN, import the certificates in the magnum DB and drop everything in the barbican DB. One less dependency for magnum.