README.md 4.26 KB
Newer Older
Domenico Giordano's avatar
Domenico Giordano committed
1
# Anomaly Detection Pipeline based on Airflow
Domenico Giordano's avatar
Domenico Giordano committed
2

Domenico Giordano's avatar
Domenico Giordano committed
3
4
5
6
7
8
9
The Anomaly Detection task can also be run in an automatic way. 
For doing that we rely on [Apache Airflow](https://airflow.apache.org/).
To have an easy to use environment we encapsulated all the required blocks (Airflow included) in Docker Containers that can be run thanks to Docker compose.
The Airflow Docker compose is heavily based on examples found in https://github.com/puckel/docker-airflow


This area is called `Control room` and  contains the procedures to deploy the Airflow setup and automate the Anomaly Detection task.
Domenico Giordano's avatar
Domenico Giordano committed
10
11
12
13
14
15
16
17
18
19
20

The folder includes

1. Installation scripts ([link](install_AD.sh))<br>
   To be run once when a new machine needs to be configured  
1. Docker-compose configuration ([link](airflow-compose))<br>
   To setup the Airflow system
1. Configuration files ([link](config_file))<br>
   Configuration files used for .... #FIXME

   
Domenico Giordano's avatar
Domenico Giordano committed
21
22
23
24
25

## Getting started

We suggest to run on a dedicated virtual machine VM that can be provisioned on the [OpenStack CERN Platform](https://openstack.cern.ch/). For initial test we suggest to start with a flavor providing at least 7 GB of RAM.

Domenico Giordano's avatar
Domenico Giordano committed
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

1. Login to the VM (**tested on CentOS 7**) with the following port forwarding:
```
VM=your_vm_name
ssh -L 5003:localhost:5003 -L 8080:localhost:8080 root@$VM
```

Notice that if running from outside CERN you could need to double hop to get  port forwarding

```
VM=your_vm_name
ssh -L 9999:$VM:22 lxtunnel.cern.ch
ssh  -o StrictHostKeyChecking=no  -i ~/.ssh/id_rsa -L 8080:localhost:8080 -L 5003:localhost:5003 localhost -p 9999  -l root
```

2. When starting from a new VM, few packages need to be installed, if already not available in the VM.
For instance docker-compose, and the data-analytics package itself.
In addition, to enable the connection to the Spark cluster with `kinit`, the `secret` credentials have to be made available
and firewall rules to be setup.

Does the VM require initial installation?
   * **No**: go to next step 
   * **YES**: Run the [install_AD.sh](install_AD.sh)  script.

Run the script on your system and it will download all the necessary files in the folder **/opt/ad_system/** of your current machine.
In general the branch should be **master** (default) or a given gitlab **tag**, but any other branch can be configured, changing the env variable branch 
```
export branch=master
curl https://gitlab.cern.ch/cloud-infrastructure/data-analytics/-/raw/$branch/control_room/install_AD.sh -O 
. ./install_AD.sh
install_all
```

3. Start the docker compose of the Airflow-based Anomaly Detection System with the following command:
```
61
sudo -u airflow /opt/ad_system/control_ad_system/start_ad_system.sh
Domenico Giordano's avatar
Domenico Giordano committed
62
63
```

64
65
NB: the script `/opt/ad_system/control_ad_system/start_ad_system.sh` can also be sourced, to easily delete the running docker-compose setup

Domenico Giordano's avatar
Domenico Giordano committed
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
**Congratulation!** You just complete the full installation of your Anomaly Detection System.


## Getting started with your first Anomaly Detection DAG

Now that Airflow is up and running we can test the Anomaly Detection System and
its algorithms on a demo scenario.

Follow these steps:
1. Open the File Browser http://localhost:5003/ and login (username = admin, pass = admin), Navigate to the folder **/airflow-compose/dags** and open the file
    **config_variables.py**. There you have to change comments on the deploy section:
    ```
    # DEPLOY
    SYSTEM_FOLDER = "..."
    DATALAKE_FOLDER = "..."
    TMP_CONFIG = "..."
    IMAGE_NAME = "..."
    ```
    and comment the developement section:
    ```
    # DEVELOPEMENT
    # SYSTEM_FOLDER = "..."
    # DATALAKE_FOLDER = "..."
    # TMP_CONFIG = "..."
    # IMAGE_NAME = "..."
    ```
1. Open the Airflow UI: http://localhost:8080/
1. Search for the dag named **dag_ad_demo** and click on its name.
1. Click on the *graph view* tab to see the interconnection between different tasks
1. Click on the **on/off switch** nex to the header *DAG: dag_ad_demo*.

**Congratulation!** You just started your first Anomaly Detection pipeline. Check the its successful termination via the *graph view*, when all the boxes are dark green the pipeline is completed.


> **_NOTE:_** The file browser is used to create new Airflow DAG (Direct Acyclic Graphs) and to modify the configuration files. Access it from here http://localhost:5003/ with username = admin, pass = admin.