README.md 4.14 KB
Newer Older
Domenico Giordano's avatar
Domenico Giordano committed
1
# Anomaly Detection Pipeline based on Airflow
Domenico Giordano's avatar
Domenico Giordano committed
2

Domenico Giordano's avatar
Domenico Giordano committed
3
4
5
6
7
8
9
The Anomaly Detection task can also be run in an automatic way. 
For doing that we rely on [Apache Airflow](https://airflow.apache.org/).
To have an easy to use environment we encapsulated all the required blocks (Airflow included) in Docker Containers that can be run thanks to Docker compose.
The Airflow Docker compose is heavily based on examples found in https://github.com/puckel/docker-airflow


This area is called `Control room` and  contains the procedures to deploy the Airflow setup and automate the Anomaly Detection task.
Domenico Giordano's avatar
Domenico Giordano committed
10
11
12
13
14
15
16

The folder includes

1. Installation scripts ([link](install_AD.sh))<br>
   To be run once when a new machine needs to be configured  
1. Docker-compose configuration ([link](airflow-compose))<br>
   To setup the Airflow system
Domenico Giordano's avatar
Domenico Giordano committed
17
1. Docker-swarm configuration - w.i.p. ([link](docker-swarm))<br>
Domenico Giordano's avatar
Domenico Giordano committed
18
19

   
Domenico Giordano's avatar
Domenico Giordano committed
20
21
22
23
24

## Getting started

We suggest to run on a dedicated virtual machine VM that can be provisioned on the [OpenStack CERN Platform](https://openstack.cern.ch/). For initial test we suggest to start with a flavor providing at least 7 GB of RAM.

Domenico Giordano's avatar
Domenico Giordano committed
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

1. Login to the VM (**tested on CentOS 7**) with the following port forwarding:
```
VM=your_vm_name
ssh -L 5003:localhost:5003 -L 8080:localhost:8080 root@$VM
```

Notice that if running from outside CERN you could need to double hop to get  port forwarding

```
VM=your_vm_name
ssh -L 9999:$VM:22 lxtunnel.cern.ch
ssh  -o StrictHostKeyChecking=no  -i ~/.ssh/id_rsa -L 8080:localhost:8080 -L 5003:localhost:5003 localhost -p 9999  -l root
```

2. When starting from a new VM, few packages need to be installed, if already not available in the VM.
For instance docker-compose, and the data-analytics package itself.
In addition, to enable the connection to the Spark cluster with `kinit`, the `secret` credentials have to be made available
and firewall rules to be setup.

Does the VM require initial installation?
   * **No**: go to next step 
   * **YES**: Run the [install_AD.sh](install_AD.sh)  script.

Run the script on your system and it will download all the necessary files in the folder **/opt/ad_system/** of your current machine.
In general the branch should be **master** (default) or a given gitlab **tag**, but any other branch can be configured, changing the env variable branch 
```
export branch=master
curl https://gitlab.cern.ch/cloud-infrastructure/data-analytics/-/raw/$branch/control_room/install_AD.sh -O 
. ./install_AD.sh
install_all
```

58
59
Follow then the instructions printed by the install_AD.sh script to finalise the setup 

Domenico Giordano's avatar
Domenico Giordano committed
60
61
3. Start the docker compose of the Airflow-based Anomaly Detection System with the following command:
```
62
sudo -u airflow /opt/ad_system/control_ad_system/start_ad_system.sh
Domenico Giordano's avatar
Domenico Giordano committed
63
64
```

65
66
NB: the script `/opt/ad_system/control_ad_system/start_ad_system.sh` can also be sourced, to easily delete the running docker-compose setup

Domenico Giordano's avatar
Domenico Giordano committed
67
68
69
**Congratulation!** You just complete the full installation of your Anomaly Detection System.


Domenico Giordano's avatar
Domenico Giordano committed
70
### Getting started with Anomaly Detection DAG  
Domenico Giordano's avatar
Domenico Giordano committed
71
72
73
74
75
76
77

Now that Airflow is up and running we can test the Anomaly Detection System and
its algorithms on a demo scenario.

1. Open the Airflow UI: http://localhost:8080/
1. Search for the dag named **dag_ad_demo** and click on its name.
1. Click on the *graph view* tab to see the interconnection between different tasks
Domenico Giordano's avatar
Domenico Giordano committed
78
1. Click on the **on/off switch** next to the header *DAG: dag_ad_demo*.
Domenico Giordano's avatar
Domenico Giordano committed
79

Domenico Giordano's avatar
Domenico Giordano committed
80
81
82
83
84
**Congratulation!** You just started your first Anomaly Detection pipeline. Check then its successful termination via the *graph view*, 
when all the boxes are dark green the pipeline is completed.

## Additional Documentation

Domenico Giordano's avatar
Domenico Giordano committed
85
86
The Anomaly Detection System driven by Airflow can be started not only with Docker compose (our implemented choice) but also with
Docker Swarm (requires a Swarm cluster is already up) or using Kubernetes. These two methods are still work in progress.
Domenico Giordano's avatar
Domenico Giordano committed
87

Domenico Giordano's avatar
Domenico Giordano committed
88
89
90
91
- In case of a docker swarm, to continue the work look at the scripts in folder [docker-swarm](./docker-swarm)
- For Kubernetes documentation can be found at
      - https://kubernetes.io/blog/2018/06/28/airflow-on-kubernetes-part-1-a-different-kind-of-operator/
      - https://airflow.apache.org/docs/stable/kubernetes.html
Domenico Giordano's avatar
Domenico Giordano committed
92
93