This folder contains the library necessary to download the HDFS data:
1. cluster_utils.py + utils.py: methods to access the Spark context and prepare HDFS paths and folders.
1. etl_steps.py: low level operations that define how to aggregate the time series data, with which granularity, the hard part of Spark processing is here. Note that the normalization strategy is here.
1. etl_pipeline.py: how to combine basic steps into the final ETL pipeline from definition of what you want to find them stored in the desired way in HDFS under the desired path.
\ No newline at end of file
1. etl_pipeline.py: how to combine basic steps into the final ETL pipeline from definition of what you want to find them stored in the desired way in HDFS under the desired path.
<hr>
Let's describe more details about the most important functions that you can find in the different files inside this folder (keep in my mind that, especially for atomic operations, you can find some documentation in the code about the functions and the expected parameters).
Create the aggregate every x minutes (e.g. if every 10 min it means that data between 15:10 and 15:20 are summarized with the mean statistic and they will get timestamp 15:20.
-**normalize**(spark, df, df_normalization)
Remove the mean and divide by the std deviation to the value column.
Create a window with the timesteps for history and past. Create the lagged timesteps for each column (aka plugin). Do the same also for furture steps. Note that beforhand, all the missing timesteps have to be replaced with a null value.
Note that an id to identify that group of plugins will be created.
## etl_pipeline.py
- run_pipeline_all_in_one(spark, config_filepath)
An unique function called in the ETL Airflow pipeline that produces the windwos datasets in HDFS, divided by day and hostgroup. (We use almost every main function of etl_steps.py).
Run the pipeline to get the coefficeint and create a normalization df. It produces normalization datasets in HDFS with noramlization coefficents (e.g. mean and stddev) for every pair of (hostgroup, plugin).
(This for example is not used in the "all_in_one" function above, but it will be used in a single task on Airflow for preparing the normalization coefficients).