Skip to content
Snippets Groups Projects
Commit 757c3376 authored by Nacho Barrientos's avatar Nacho Barrientos Committed by Guillermo Facundo Colunga
Browse files

[NOTICKET] metrics: enable fluentbit local buffer to avoid metrics loss


when fluentbit loses connection to the metrics endpoint (open telemetry), unsent metrics are dropped and permanently lost. This commit enables the local buffering feature in Fluent Bit, allowing metrics to be temporarily stored in a local buffer. When the connection is reestablished, these buffered metrics are retried and sent to the endpoint. This improves reliability by minimizing metric loss during intermittent connection issues.

Proposed-by: default avatarNacho Barrientos <nacho.barrientos@cern.ch>
parent e2e91101
No related branches found
No related tags found
2 merge requests!22Qa->Master,!18metrics: implement two layer (disk/mem) storage
Pipeline #8253492 passed
apiVersion: v2
name: cern-it-kubernetes-monitoring
type: application
version: 0.1.3
version: 0.2.0
kubeVersion: ">=1.27.0-0"
description: Helm Chart provided by IT Monitoring Service to install and configure required components to gather and send monitoring data from kubernetes clusters to central service.
home: https://cern.ch/monitoring
......@@ -38,6 +38,7 @@ At `docs/installation_guide.md` you will find the initial setup and installation
| logs.fluentbit.service | string | Daemon mode off listening on port 2020. See `values.yaml`. | fluentbit service configuration options in a multiline string |
| metrics.enabled | bool | `true` | indicates if all metrics components should be enabled or not. If set to false no metrics component will be installed nor configured |
| metrics.fluentbit.enable | bool | `true` | if true fluentbit daemon set will be installed |
| metrics.fluentbit.diskMaxCache | string | `5G` | max size for in-disk storage for fluent-bit |
| metrics.fluentbit.nodeSelector | hash | `"nil"` | fluentbit statefulset node selectors |
| metrics.fluentbit.filters | string | `"nil"` | fluentbit filters as a yaml list in a multiline string |
| metrics.fluentbit.inputs | string | Configuration to scrape local prometheus. See `values.yaml`. | fluentbit inputs as a yaml list in a multiline string |
......@@ -45,7 +46,7 @@ At `docs/installation_guide.md` you will find the initial setup and installation
| metrics.fluentbit.prometheusScrapeBufferMaxSize | string | `"100M"` | fluentbit buffer size. The more metrics to send the bigger needs to be |
| metrics.fluentbit.prometheusScrapeInterval | string | `"60s"` | interval used by fluentbit to scrape metrics from prometheus |
| metrics.fluentbit.resources.limits.cpu | string | `"1"` | |
| metrics.fluentbit.resources.limits.memory | string | `"500Mi"` | |
| metrics.fluentbit.resources.limits.memory | string | `"1Gi"` | |
| metrics.fluentbit.resources.requests.cpu | string | `"1"` | |
| metrics.fluentbit.resources.requests.memory | string | `"150Mi"` | |
| metrics.fluentbit.service | string | Daemon mode off listening on port 2020. See `values.yaml`. | fluentbit service configuration options in a multiline string |
......
......@@ -33,6 +33,8 @@ spec:
volumeMounts:
- name: config
mountPath: /fluent-bit/etc/conf
- name: fluentbit
mountPath: /flb-storage/
{{- if .Values.metrics.fluentbit.extraVolumeMounts }}
{{- toYaml .Values.metrics.fluentbit.extraVolumeMounts | nindent 6 }}
{{- end }}
......@@ -44,6 +46,9 @@ spec:
- name: config
configMap:
name: it-monit-metrics-collector-fluentbit
- name: fluentbit
emptyDir:
sizeLimit: {{ .Values.metrics.fluentbit.diskMaxCache }}
{{- if .Values.metrics.fluentbit.extraVolumes }}
{{- toYaml .Values.metrics.fluentbit.extraVolumes | nindent 4 }}
{{- end }}
......
......@@ -132,7 +132,10 @@ metrics:
prometheusScrapeInterval: "60s"
# -- fluentbit buffer size. The more metrics to send the bigger needs to be
prometheusScrapeBufferMaxSize: "100M"
# -- max size for in-disk storage for fluent-bit
diskMaxCache: "5G"
# -- fluentbit service configuration options in a multiline string
service: |
daemon: off
......@@ -142,6 +145,13 @@ metrics:
http_listen: 0.0.0.0
http_port: 2020
health_check: on
storage.path: /flb-storage/
storage.sync: normal
storage.checksum: off
storage.backlog.mem_limit: 5M
# sub (div .Values.metrics.fluentbit.resources.limits.memory 2097152) 5
# 2097152 is max chunk size in bytes || -5 to stay away from the pod memory limit
storage.max_chunks_up: 507
# -- fluentbit inputs as a yaml list in a multiline string
inputs: |
......@@ -149,6 +159,7 @@ metrics:
tag: monit.prom.k8s
host: prometheus-operated.{{ .Release.Namespace }}.svc.cluster.local
port: 9090
storage.type: filesystem
scrape_interval: {{ .Values.metrics.fluentbit.prometheusScrapeInterval }}
metrics_path: /federate?{{ .Values.metrics.fluentbit.matchQuery }}
buffer_max_size: {{ .Values.metrics.fluentbit.prometheusScrapeBufferMaxSize }}
......@@ -169,6 +180,7 @@ metrics:
tls.verify: off
http_user: {{ .Values.tenant.name }}
http_passwd: {{ .Values.tenant.password }}
storage.total_limit_size: {{ .Values.metrics.fluentbit.diskMaxCache }}
logs:
# -- indicates if logs metrics components should be enabled or not. If set to false no logs component will be installed nor configured
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment