Data Analytics
qa | master |
---|---|
The project contains a suite of tools to run data analytics pipelines on the monitoring data of the CERN Cloud Infrastructure.
Some of the functionalities supported are:
- Extraction of time series data from CERN databases: InfluxDB, ElasticSearch, HDFS
- Pre-processing of the data with the Spark Cluster (used in client mode)
- Analysis of time series for Anomaly detection
- Automation of the processing pipeline with Airflow
- Grafana extension for Annotation functionalities
A central part of this project is the Anomaly Detection on time series data. This time series data can come from:
- metrics measured for each hypervisor in the Data Centre.
- derived timeseries from log file analysis.
The CI/CD of this project is used to
- Run unit tests and quality checks for the implemented code
- Build Docker images with pre-installed libraries needed for the project's scope
- Run functional tests of the Data Analytics' pipeline, and its components
The repository contains extensive documentation of each subfolder in the README file included in the specific subfolder. This is a guide map of the repository:
- ETL libraries (link)
Implement the extraction of data from the different monitoring databases: InfluxDB, ElasticSearch, HDFS - Tests suite (link)
Unit tests of the ETL libraries, test pipelines' components - Javascript Grafana extension (link)
Implement an extension of the Grafana Annotation panel, modifying the Grafana JS code - Anomaly detection libraries (link)
Implement anomaly detection Models, based on pyOD, traditional ML and DL methods - Docker image definition (link)
Dockerfiles for images used in this project - Airflow-based Anomaly Detection System (link)
Setup and run the Anomaly Detection System
From where to start
Detail procedures for newcomers (W.I.P.)