metrics: implement two layer (disk/mem) storage
This changeset exposes a new ephemeral volume to store Prometheus-scrapped metrics if the in-memory storage gets full. The goal is to:
- Prevent the pod from being killed by the controller if it fails to flush metrics to the outputs for too long (memory limit exceeded).
- Allow the ingestion component to recover from outages by keeping some metrics on disk so they can be sent later on.
The configuration sets a hard limit for memory
storage (max_chunks_up
) that's slightly lower than the pod hard
memory limit. A 5 GiB on-disk storage is configured by default, as
well.
Once the memory and the disk storage are both full, Fluentbit behaves as follows:
If one of the destinations reaches the configured storage.total_limit_size, the oldest Chunk from its queue for that logical output destination will be discarded to make room for new data.
So metrics will be lost at this point but the pod (and the logs!) should always stay alive.
https://docs.fluentbit.io/manual/administration/buffering-and-storage#buffering-and-memory