Add telemetry to CTA
There are various production/operation use cases that drive the need for telemetry. We need concrete data to better understand the problems we are encountering.
The way this will work is via OpenTelemetry as this is a common standard in the industry and fits our needs nicely. Each service will export its metrics to a collector. This collector then exposes an endpoint that Prometheus can scrape the metrics from.
As per the documentation, OpenTelemetry itself recommends the use of this collector as it is vendor agnostic. Additionally, this "allows your service to offload data quickly and the collector can take care of additional handling like retries, batching, encryption or even sensitive data filtering."
The plan of attack is as follows:
- Ensure we can compile CTA with OpenTelemetry
- Add functionality to the frontend to do e.g. a simple counter
- Be able to spawn the collector and Prometheus in CI
- Be able to view the visualization of the metrics
Two very important design principles:
- Telemetry is disabled by default. In production, when upgrading to a version with telemetry from a version without telemetry, there should be absolutely no changes. Only when the config file is updated can/is it enabled.
- The only exception is that we need the opentelemetry RPMs available in order to run CTA
- Telemetry should not throw exceptions causing CTA to crash. Telemetry is optional to get more information, so it should not be affecting the running service negatively.
- Seems to be (mostly) guaranteed by
opentelemetry-cpp
: https://opentelemetry.io/docs/specs/otel/error-handling/#basic-error-handling-principles
- Seems to be (mostly) guaranteed by
For reference, a previous ticket on OpenTelemetry can be found here #266 (closed). We will be using this ticket for further ongoing discussions. Before the corresponding MR is merged, we should also be updating the documentation acccordingly.