README.md 4.33 KB
Newer Older
Domenico Giordano's avatar
Domenico Giordano committed
1
# Anomaly Detection Pipeline based on Airflow
Domenico Giordano's avatar
Domenico Giordano committed
2

Domenico Giordano's avatar
Domenico Giordano committed
3
4
5
6
7
8
9
The Anomaly Detection task can also be run in an automatic way. 
For doing that we rely on [Apache Airflow](https://airflow.apache.org/).
To have an easy to use environment we encapsulated all the required blocks (Airflow included) in Docker Containers that can be run thanks to Docker compose.
The Airflow Docker compose is heavily based on examples found in https://github.com/puckel/docker-airflow


This area is called `Control room` and  contains the procedures to deploy the Airflow setup and automate the Anomaly Detection task.
Domenico Giordano's avatar
Domenico Giordano committed
10
11
12
13
14
15
16
17
18
19
20

The folder includes

1. Installation scripts ([link](install_AD.sh))<br>
   To be run once when a new machine needs to be configured  
1. Docker-compose configuration ([link](airflow-compose))<br>
   To setup the Airflow system
1. Configuration files ([link](config_file))<br>
   Configuration files used for .... #FIXME

   
Domenico Giordano's avatar
Domenico Giordano committed
21
22
23
24
25

## Getting started

We suggest to run on a dedicated virtual machine VM that can be provisioned on the [OpenStack CERN Platform](https://openstack.cern.ch/). For initial test we suggest to start with a flavor providing at least 7 GB of RAM.

Domenico Giordano's avatar
Domenico Giordano committed
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

1. Login to the VM (**tested on CentOS 7**) with the following port forwarding:
```
VM=your_vm_name
ssh -L 5003:localhost:5003 -L 8080:localhost:8080 root@$VM
```

Notice that if running from outside CERN you could need to double hop to get  port forwarding

```
VM=your_vm_name
ssh -L 9999:$VM:22 lxtunnel.cern.ch
ssh  -o StrictHostKeyChecking=no  -i ~/.ssh/id_rsa -L 8080:localhost:8080 -L 5003:localhost:5003 localhost -p 9999  -l root
```

2. When starting from a new VM, few packages need to be installed, if already not available in the VM.
For instance docker-compose, and the data-analytics package itself.
In addition, to enable the connection to the Spark cluster with `kinit`, the `secret` credentials have to be made available
and firewall rules to be setup.

Does the VM require initial installation?
   * **No**: go to next step 
   * **YES**: Run the [install_AD.sh](install_AD.sh)  script.

Run the script on your system and it will download all the necessary files in the folder **/opt/ad_system/** of your current machine.
In general the branch should be **master** (default) or a given gitlab **tag**, but any other branch can be configured, changing the env variable branch 
```
export branch=master
curl https://gitlab.cern.ch/cloud-infrastructure/data-analytics/-/raw/$branch/control_room/install_AD.sh -O 
. ./install_AD.sh
install_all
```

59
60
Follow then the instructions printed by the install_AD.sh script to finalise the setup 

Domenico Giordano's avatar
Domenico Giordano committed
61
62
3. Start the docker compose of the Airflow-based Anomaly Detection System with the following command:
```
63
sudo -u airflow /opt/ad_system/control_ad_system/start_ad_system.sh
Domenico Giordano's avatar
Domenico Giordano committed
64
65
```

66
67
NB: the script `/opt/ad_system/control_ad_system/start_ad_system.sh` can also be sourced, to easily delete the running docker-compose setup

Domenico Giordano's avatar
Domenico Giordano committed
68
69
70
**Congratulation!** You just complete the full installation of your Anomaly Detection System.


Domenico Giordano's avatar
Domenico Giordano committed
71
### Getting started with Anomaly Detection DAG  
Domenico Giordano's avatar
Domenico Giordano committed
72
73
74
75
76
77
78

Now that Airflow is up and running we can test the Anomaly Detection System and
its algorithms on a demo scenario.

1. Open the Airflow UI: http://localhost:8080/
1. Search for the dag named **dag_ad_demo** and click on its name.
1. Click on the *graph view* tab to see the interconnection between different tasks
Domenico Giordano's avatar
Domenico Giordano committed
79
1. Click on the **on/off switch** next to the header *DAG: dag_ad_demo*.
Domenico Giordano's avatar
Domenico Giordano committed
80

Domenico Giordano's avatar
Domenico Giordano committed
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
**Congratulation!** You just started your first Anomaly Detection pipeline. Check then its successful termination via the *graph view*, 
when all the boxes are dark green the pipeline is completed.

## Additional Documentation

The Anomaly Detection System driven by Airflow can be started in different ways
1. using Docker compose on a given VM (our implemented choice!)
1. using Docker Swarm (requires a Swarm cluster is already up) 
1. using Kubernetes (w.i.p)

Details are in the following paragraphs.

### Docker Compose

This is the standard method. The instructions are in this same README.md

### Docker Swarm 

This is still W.I.P. 
In case of a docker swarm, run to start look into the script in folder [docker-swarm](./docker-swarm)

### Kubernetes
 
This is still W.I.P. 

Documentation can be found at

- https://kubernetes.io/blog/2018/06/28/airflow-on-kubernetes-part-1-a-different-kind-of-operator/
- https://airflow.apache.org/docs/stable/kubernetes.html
Domenico Giordano's avatar
Domenico Giordano committed
110
111