Skip to content

Add saturation metric for cta-frontend

At the moment we have the following metrics for the cta-frontend:

  • cta.frontend.request.duration: measures the duration of any incoming request
    • Covers workflow events and admin commands
    • Also includes error rates: the error_type label is added on errors
  • Any central metrics such as catalogue and scheduler rates are already included.

However, to better understand the health of the frontend, we should have some information on the saturation. I.e. how busy is our frontend.

An initial idea could be to expose the number of active threads. However, the granularity is determined by the export interval, so we would lose a lot of information. E.g. no activity, then a giant spike of 10 seconds and then no activity would not show in an export interval of 15 seconds.

A better approach is to make use of our existing duration metric. Using cta_frontend_request_duration_sum, we get the sum of all durations in a given time window. If we know the maximum number of threads, we know how much time we hypothetically could have spend. Dividing the two gives us an idea of the saturation.

To do this, we will have to emit one new metric, which is the thread pool size (shouldn't change over the lifetime of the process).


@kskovola it would be good if you could think about other potential metrics that would be useful for the frontend. If just adding the saturation is sufficient that is fine as well; we shouldn't be adding metrics just for the sake of it :)