Skip to content
Snippets Groups Projects
Nacho Barrientos's avatar
Nacho Barrientos authored
41bfaa75
History

cern-it-monitoring-kubernetes

Helm Chart provided by IT Monitoring Service to install and configure required components to gather and send monitoring data from kubernetes clusters to central service.

Overview

The CERN IT Monitoring Kubernetes Helm Chart provides a solution for monitoring Kubernetes clusters at CERN. It enables the collection of metrics, logs, and future support for traces, which are forwarded to the central CERN monitoring infrastructure. From there users can consume them using the day-to-day tools that they already user like Grafana or OpenSearch.

This Helm chart simplifies the deployment and configuration of necessary components for observability, making it easier to manage monitoring across various Kubernetes clusters and their applications.

Quick Start

See getting started.

Values

Key Type Default Description
crds.enabled bool true whether to install Prometheus operator's CRDs
fluentbit.image.imagePullPolicy string "IfNotPresent" image pull policy applied to all Fluent Bit instances
fluentbit.image.repository string "registry.cern.ch/monit/cern-it-monitoring-fluent-bit" image repository applied to all Fluent Bit instances
fluentbit.image.tag string "3.2.6" image tag applied to all Fluent Bit instances
kubernetes.clusterName string "" name of the kubernetes cluster to monitor. This value will be appended to very metric and log via k8s_cluster_name label. This bit is required if fluentbit is enabled (default)
logs.enabled bool false indicates if logs metrics components should be enabled or not. If set to false no logs component will be installed nor configured
logs.fluentbit.customParsers string ""
logs.fluentbit.enabled bool false indicates if fluentbit logs component should be installed or not
logs.fluentbit.extraVolumeMounts list []
logs.fluentbit.extraVolumes list []
logs.fluentbit.filters string partly autogenerated -- see values.yaml fluentbit filters as a yaml list in a multiline string
logs.fluentbit.image.imagePullPolicy string "" image pull policy for Fluent Bit (logs)
logs.fluentbit.image.repository string "" repository to use for Fluent Bit (logs)
logs.fluentbit.image.tag string "" tag to use for Fluent Bit (logs)
logs.fluentbit.inputs string partly autogenerated -- see values.yaml fluentbit inputs as a yaml list in a multiline string
logs.fluentbit.luaScripts object {} extra Lua scripts for user-provided transformations
logs.fluentbit.outputs string partly autogenerated -- see values.yaml fluentbit outputs as a yaml list in a multiline string
logs.fluentbit.resources.limits.cpu string "20m"
logs.fluentbit.resources.limits.memory string "25Mi"
logs.fluentbit.resources.requests.cpu string "5m"
logs.fluentbit.resources.requests.memory string "15Mi"
logs.fluentbit.service string partly autogenerated -- see values.yaml fluentbit service configuration options in a multiline string
metrics.alertmanager.enabled bool false if true alertmanager will be installed and prometheus reconfigured to use it as the alerting endpoint
metrics.alertmanager.image string "registry.cern.ch/monit/cern-it-monitoring-alertmanager" alertmanager image to use by the local cluster alertmanager
metrics.alertmanager.ingress.className string "" class name to be used by the alertmanager ingress
metrics.alertmanager.ingress.enabled bool false if set to true an ingress will be created for the alertmanager service
metrics.alertmanager.ingress.hosts list [] list of hosts for the alertmanager ingress
metrics.alertmanager.ingress.path string "/" entry path for the alertmanager ingress
metrics.alertmanager.ingress.pathType string "ImplementationSpecific" path type for the alertmanager ingress
metrics.alertmanager.ingress.tls object {} tls configuration for the alertmanager ingress
metrics.alertmanager.nodeSelector object {} node selector configuration for the alertmanager
metrics.alertmanager.pullPolicy string "IfNotPresent" pull policy for the alertmanager image
metrics.alertmanager.replicas int 3 number of replicas for the alertmanager deployment
metrics.alertmanager.tag string "v0.27.0" alertmanager image tag to be used when pulling it
metrics.alertmanager.volumeMounts list [] list of volumes to be mounted
metrics.alertmanager.volumes list [] list of volumes to be declared
metrics.apiServer.serviceMonitor.relabelings list []
metrics.coredns.serviceMonitor.relabelings list []
metrics.defaultNodeSelector object {} the default node selector will be applied when possible. In to the following components: metrics collectors (prometheus and fluentbit), metrics exporters (kube state).
metrics.enabled bool true indicates if all metrics components should be enabled or not. If set to false no metrics component will be installed nor configured
metrics.etcd.serviceMonitor.relabelings list []
metrics.fluentbit.diskMaxCache string "5G" max size for in-disk storage for fluent-bit
metrics.fluentbit.enabled bool true if true fluentbit metrics forwarder will be installed
metrics.fluentbit.filters string partly autogenerated -- see values.yaml fluentbit filters as a yaml list in a multiline string
metrics.fluentbit.image.imagePullPolicy string "" image pull policy for Fluent Bit (metrics)
metrics.fluentbit.image.repository string "" repository to use for Fluent Bit (metrics)
metrics.fluentbit.image.tag string "" tag to use for Fluent Bit (metrics)
metrics.fluentbit.inputs string partly autogenerated -- see values.yaml fluentbit inputs as a yaml list in a multiline string
metrics.fluentbit.luaScripts object {} extra Lua scripts for user-provided transformations
metrics.fluentbit.nodeSelector object {}
metrics.fluentbit.outputs string partly autogenerated -- see values.yaml fluentbit outputs as a yaml list in a multiline string
metrics.fluentbit.prometheusRemoteWriteInputConfig.bufferChunkSize string "128M"
metrics.fluentbit.prometheusRemoteWriteInputConfig.bufferMaxSize string "2G"
metrics.fluentbit.prometheusRemoteWriteInputConfig.listen string "0.0.0.0"
metrics.fluentbit.prometheusRemoteWriteInputConfig.port int 8080
metrics.fluentbit.prometheusRemoteWriteInputConfig.successfulResponseCode int 201
metrics.fluentbit.prometheusRemoteWriteInputConfig.tag string "monit.prom.k8s"
metrics.fluentbit.prometheusRemoteWriteInputConfig.tagFromUri bool false
metrics.fluentbit.prometheusRemoteWriteInputConfig.threaded bool false
metrics.fluentbit.prometheusRemoteWriteInputConfig.uri string "/api/prom/push"
metrics.fluentbit.replicas int 2
metrics.fluentbit.resources.limits.cpu string "1"
metrics.fluentbit.resources.limits.memory string "1Gi"
metrics.fluentbit.resources.requests.cpu string "1"
metrics.fluentbit.resources.requests.memory string "512Mi"
metrics.fluentbit.service string partly autogenerated -- see values.yaml fluentbit service configuration options in a multiline string
metrics.ingress.nginx.serviceMonitor.relabelings list []
metrics.kubeProxy.serviceMonitor.relabelings list []
metrics.kubeState.enabled bool true if true kube state will be installed together with a service monitor
metrics.kubeState.nodeSelector object {}
metrics.kubeState.resources.limits.cpu string "20m"
metrics.kubeState.resources.limits.memory string "25Mi"
metrics.kubeState.resources.requests.cpu string "5m"
metrics.kubeState.resources.requests.memory string "15Mi"
metrics.kubeState.scrapeInterval string "30s" indicates how often this exporter will be scraped by the local prometheus
metrics.kubeState.serviceMonitor.relabelings list []
metrics.kubecontroller.serviceMonitor.relabelings list []
metrics.kubelet.serviceMonitor.relabelings list []
metrics.nodeExporter.enabled bool true if true node exporter will be installed as a daemon set together with a pod monitor
metrics.nodeExporter.resources.limits.cpu string "20m"
metrics.nodeExporter.resources.limits.memory string "25Mi"
metrics.nodeExporter.resources.requests.cpu string "5m"
metrics.nodeExporter.resources.requests.memory string "15Mi"
metrics.nodeExporter.scrapeInterval string "" indicates how often this exporter will be scraped by the local prometheus
metrics.nodeExporter.serviceMonitor.relabelings list []
metrics.prometheus.enabled bool true if true prometheus operator and a prometheus server will be installed
metrics.prometheus.operator.nodeSelector object {}
metrics.prometheus.operator.resources.limits.cpu string "100m"
metrics.prometheus.operator.resources.limits.memory string "100Mi"
metrics.prometheus.operator.resources.requests.cpu string "5m"
metrics.prometheus.operator.resources.requests.memory string "25Mi"
metrics.prometheus.server.extraLabelsForMetrics object {} set of static labels and values to add to all the metrics gathered by the in-cluster prometheus when exported to central monitoring
metrics.prometheus.server.image string "registry.cern.ch/monit/cern-it-monitoring-prometheus:v2.53.3" prometheus image to use by the local cluster prometheus
metrics.prometheus.server.nodeSelector object {} prometheus operator node selectors. If set it will override the metrics.defaultNodeSelector
metrics.prometheus.server.relabelings list [] allows to drop / relabel node Exporter metrics.
metrics.prometheus.server.remoteWrite object {} remote write prometheus configuration
metrics.prometheus.server.resources.limits.cpu string "500m"
metrics.prometheus.server.resources.limits.memory string "5Gi"
metrics.prometheus.server.resources.requests.cpu string "100m"
metrics.prometheus.server.resources.requests.memory string "2Gi"
metrics.prometheus.server.retention string "24h" interval during which local cluster prometheus will store metrics
metrics.prometheus.server.scrapeInterval string "10s" interval used to self scrape metrics
metrics.prometheus.server.scrapeTimeout string "5s" timeout for self scraped metrics
metrics.prometheus.server.serviceMonitors list [] service monitors to be created
metrics.prometheus.server.version string "v2.53.3" prometheus version to use by the local cluster prometheus
metrics.pushgateway.enabled bool false pushgateway allows you to send metrics to the monitoring infrastructure by pushing them to the local cluster service it-monit-metrics-collector-pushgateway.
metrics.pushgateway.image.pullPolicy string "IfNotPresent"
metrics.pushgateway.image.repository string "registry.cern.ch/monit/cern-it-monitoring-pushgateway"
metrics.pushgateway.image.tag string "v1.10.0"
metrics.pushgateway.ingress.className string ""
metrics.pushgateway.ingress.enabled bool false if set to true will install register a new ingress with the given configuration.
metrics.pushgateway.ingress.hosts list []
metrics.pushgateway.ingress.path string "/"
metrics.pushgateway.ingress.pathType string "ImplementationSpecific"
metrics.pushgateway.ingress.tls object {}
metrics.pushgateway.nodeSelector object {} if given will override the defaultNodeSelector and install the component only on the nodes that match the given condition.
metrics.pushgateway.resources.limits.cpu float 0.2
metrics.pushgateway.resources.limits.memory string "100Mi"
metrics.pushgateway.resources.requests.cpu float 0.2
metrics.pushgateway.resources.requests.memory string "100Mi"
metrics.scheduler.serviceMonitor.relabelings list []
metrics.scheduler.serviceMonitor.relabelings list []
otlp.endpoint string "monit-otlp.cern.ch" otlp endpoint where the otlp receivers are listening
otlp.port int 4319 otlp port where the otlp receivers are listening
tenant.name string "" username used for authenitcating in the MONIT infrastructure
tenant.password string "" password (plain) used for authenticating in the MONIT infrastructure

Contributing

We welcome contributions! If you're interested in helping improve this project, please review our contribution guidelines. In brief:

  1. Fork the repository.
  2. Create a feature branch.
  3. Implement, provide tests and validate your changes.
  4. Submit a Merge Request (MR) to the master branch.

For a full contribution workflow, visit the contribution guide.

Documentation

Complete documentation for this chart, including setup and configuration details, is available:

  • GitLab Repository: link
  • Project Documentation: link

License

This repository is licensed under the Apache License 2.0. See the LICENSE file for more information.