Skip to content

Notes regarding scaling to many webeossites

Moved from #10 (closed)

We're using the operator-sdk. Behind the scenes the SDK's manager library maintains an in-memory cache of all the webeosSite CRs (a.k.a. an Informer). The reconcile logic only takes into account the ones that match certain labels, but all of them are still being processed and kept in-memory by the SDK.

However we're still likely to face some scaling issues with this solution, because each instance of the config-operator will watch and receive events for all CRs (regardless of their labels) and keep a local in-memory cache. With dozens of webeos pods and 1000s of CRs, the load on the Kubernetes API and the memory usage of the config-operator may be unacceptably high.

So a long-term, scalable solution needs to set up the watch in such in way that:

  1. the Kubernetes API sends events only for the CRs that match our label
  2. we keep an in-memory cache of only the CRs that match our label, or no cache at all

Operator-sdk might eventually provide a solution as per the upstream issue. This patch actually solves our problem, but is not accepted upstream because it prevents access to any CR not matching the label, but this is not a problem for us - it's actually what we want!

If no solution comes from the upstream library, then we can use lower-level libraries than the operator-sdk's manager library we use now. A first starting point is this patch for the upstream issue, which solves our problem exactly. Another starting point: