README.md 4.35 KB
Newer Older
Domenico Giordano's avatar
Domenico Giordano committed
1
# Anomaly Detection Pipeline based on Airflow
Domenico Giordano's avatar
Domenico Giordano committed
2

Domenico Giordano's avatar
Domenico Giordano committed
3
4
The Anomaly Detection task can also be run in an automatic way. 
For doing that we rely on [Apache Airflow](https://airflow.apache.org/).
Domenico Giordano's avatar
Domenico Giordano committed
5
To have an easy to use environment we encapsulated all the required blocks (Airflow included) in Docker Containers that can be run thanks to *Docker compose*.
Domenico Giordano's avatar
Domenico Giordano committed
6
7
8
9
The Airflow Docker compose is heavily based on examples found in https://github.com/puckel/docker-airflow


This area is called `Control room` and  contains the procedures to deploy the Airflow setup and automate the Anomaly Detection task.
Domenico Giordano's avatar
Domenico Giordano committed
10
11
12

The folder includes

Domenico Giordano's avatar
Domenico Giordano committed
13
1. Installation scripts ([install_AD.sh](install_AD.sh))<br>
Domenico Giordano's avatar
Domenico Giordano committed
14
   To be run once when a new machine needs to be configured  
Domenico Giordano's avatar
Domenico Giordano committed
15
1. Docker-compose configuration ([airflow-compose](airflow-compose))<br>
Domenico Giordano's avatar
Domenico Giordano committed
16
   To setup the Airflow system
Domenico Giordano's avatar
Domenico Giordano committed
17
1. Docker-swarm configuration - w.i.p. ([docker-swarm](docker-swarm))<br>
Domenico Giordano's avatar
Domenico Giordano committed
18
19

   
Domenico Giordano's avatar
Domenico Giordano committed
20
21
22

## Getting started

Domenico Giordano's avatar
Domenico Giordano committed
23
24
25
26
27
The set of components that will be deployed with the following procedure is described in this image 
<br><img src="documentation/images/AD_components_deployed.png" width="70%"><br>

We suggest to run on a dedicated virtual machine VM that can be provisioned on the [OpenStack CERN Platform](https://openstack.cern.ch/). 
<br> For initial test we suggest to start with a flavor providing at least 7 GB of RAM.
Domenico Giordano's avatar
Domenico Giordano committed
28

Domenico Giordano's avatar
Domenico Giordano committed
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

1. Login to the VM (**tested on CentOS 7**) with the following port forwarding:
```
VM=your_vm_name
ssh -L 5003:localhost:5003 -L 8080:localhost:8080 root@$VM
```

Notice that if running from outside CERN you could need to double hop to get  port forwarding

```
VM=your_vm_name
ssh -L 9999:$VM:22 lxtunnel.cern.ch
ssh  -o StrictHostKeyChecking=no  -i ~/.ssh/id_rsa -L 8080:localhost:8080 -L 5003:localhost:5003 localhost -p 9999  -l root
```

2. When starting from a new VM, few packages need to be installed, if already not available in the VM.
For instance docker-compose, and the data-analytics package itself.
In addition, to enable the connection to the Spark cluster with `kinit`, the `secret` credentials have to be made available
and firewall rules to be setup.

Does the VM require initial installation?
   * **No**: go to next step 
   * **YES**: Run the [install_AD.sh](install_AD.sh)  script.

Run the script on your system and it will download all the necessary files in the folder **/opt/ad_system/** of your current machine.
In general the branch should be **master** (default) or a given gitlab **tag**, but any other branch can be configured, changing the env variable branch 
```
export branch=master
57
curl https://gitlab.cern.ch/cloud-infrastructure/data-analytics/-/raw/$branch/deploy_AD/install_AD.sh -O 
Domenico Giordano's avatar
Domenico Giordano committed
58
59
60
61
. ./install_AD.sh
install_all
```

62
63
Follow then the instructions printed by the install_AD.sh script to finalise the setup 

Domenico Giordano's avatar
Domenico Giordano committed
64
65
3. Start the docker compose of the Airflow-based Anomaly Detection System with the following command:
```
66
sudo -u airflow /opt/ad_system/control_ad_system/start_ad_system.sh
Domenico Giordano's avatar
Domenico Giordano committed
67
68
```

69
70
NB: the script `/opt/ad_system/control_ad_system/start_ad_system.sh` can also be sourced, to easily delete the running docker-compose setup

Domenico Giordano's avatar
Domenico Giordano committed
71
72
73
**Congratulation!** You just complete the full installation of your Anomaly Detection System.


Domenico Giordano's avatar
Domenico Giordano committed
74
### Getting started with Anomaly Detection DAG  
Domenico Giordano's avatar
Domenico Giordano committed
75
76
77
78
79

Now that Airflow is up and running we can test the Anomaly Detection System and
its algorithms on a demo scenario.

1. Open the Airflow UI: http://localhost:8080/
Domenico Giordano's avatar
Domenico Giordano committed
80
81
82
1. Select the *DAGs* tab from the Airflow menu.
1. Go to the *Graph View* tab to see the interconnection between different tasks
1. Click on the **on/off switch** next to the header *DAG: <dag_name>* to enable it.
Domenico Giordano's avatar
Domenico Giordano committed
83

Domenico Giordano's avatar
Domenico Giordano committed
84
85
**Congratulation!** You just started your first Anomaly Detection pipeline. Check then its successful termination via the *Tree view*, 
when the last box is dark green the pipeline is completed successfully.
Domenico Giordano's avatar
Domenico Giordano committed
86
87
88

## Additional Documentation

Domenico Giordano's avatar
Domenico Giordano committed
89
90
The Anomaly Detection System driven by Airflow can be started not only with Docker compose (our implemented choice) but also with
Docker Swarm (requires a Swarm cluster is already up) or using Kubernetes. These two methods are still work in progress.
Domenico Giordano's avatar
Domenico Giordano committed
91

Domenico Giordano's avatar
Domenico Giordano committed
92
93
- In case of a docker swarm, to continue the work look at the scripts in folder [docker-swarm](./docker-swarm)
- For Kubernetes documentation can be found at
Domenico Giordano's avatar
Domenico Giordano committed
94
95
   - https://kubernetes.io/blog/2018/06/28/airflow-on-kubernetes-part-1-a-different-kind-of-operator/
   - https://airflow.apache.org/docs/stable/kubernetes.html
Domenico Giordano's avatar
Domenico Giordano committed
96
97