cloud-infrastructure
data-analytics

Repository



Data Analytics


qa
master


The project contains a suite of tools to run data analytics pipelines on the monitoring data of the CERN Cloud Infrastructure.
Some of the functionalities supported are:

Extraction of time series data from CERN databases: InfluxDB, ElasticSearch, HDFS
Pre-processing of the data with the Spark Cluster (used in client mode)
Analysis of time series for Anomaly detection
Automation of the processing pipeline with Airflow
Grafana extension for Annotation functionalities

A central part of this project is the Anomaly Detection on time series data.
This time series data can come from:

metrics measured for each hypervisor in the Data Centre.
derived timeseries from log file analysis.

The CI/CD of this project is used to

Run unit tests and quality checks for the implemented code
Build Docker images with pre-installed libraries needed for the project's scope
Run functional tests of the Data Analytics' pipeline, and its components

The repository contains extensive documentation of each subfolder in the README file included in the specific subfolder.
This is a guide map of the repository:

ETL libraries (link)

Implement the extraction of data from the different monitoring databases: InfluxDB, ElasticSearch, HDFS
Tests suite (link)

Unit tests of the ETL libraries, test pipelines' components
Javascript Grafana extension (link)

Implement an extension of the Grafana Annotation panel, modifying the Grafana JS code
Anomaly detection libraries (link)

Implement anomaly detection Models, based on pyOD, traditional ML and DL methods
Docker image definition (link)

Dockerfiles for images used in this project
Airflow-based Anomaly Detection System (link)

Setup and run the Anomaly Detection System


From where to start
Detail procedures for newcomers (W.I.P.)