Skip to content
Snippets Groups Projects
Commit 01a846b4 authored by Konstantinos Samaras-Tsakiris's avatar Konstantinos Samaras-Tsakiris
Browse files

Show how to fetch logs from long-term HDFS storage

parent 5e0a0521
No related branches found
No related tags found
No related merge requests found
Pipeline #4033421 passed
......@@ -3,3 +3,7 @@
This project includes operations on Drupal sites or across infrastructure that are not part of the drupalSite-operator, such as Tekton tasks.
We implement actions that the infrastructure users can apply ad-hoc to their websites, and also other infrastructure components can use to perform their tasks.
## Examples
In Examples there are some one-off operations that show how something could be done, without providing full automation.
### tekton-tasks
Examples of backing up or restoring Drupal sites.
### logs-hdfs
How to fetch site logs from long-term storage on HDFS.
This Jupyter notebook should be run on the SWAN service ([swan.cern.ch](https://swan.cern.ch))
using the SPARK plugin.
%% Cell type:code id:bb5bd38c tags:
``` python
from pyspark.sql.functions import col, asc, countDistinct, date_format, from_unixtime
from pyspark.sql import functions as F
from datetime import date, timedelta
import pandas as pd
from pyspark.sql import DataFrame
```
%% Cell type:code id:05d5f738 tags:
``` python
start_date = date(2020, 8, 1)
duration= timedelta(days=31)
def fetchLogs(dates):
paths= ['/project/monitoring/archive/drupal/logs/prod8/drupal_service/drupal8/'+date.strftime("%Y/%m/%d")+'/*'
for date in dates]
return spark.read.json(paths)
#return sc.union(spark.read.json(paths))
def selectClientip(log, whereFilt):
return log.select(col("data.clientip"), date_format(from_unixtime(col("metadata.timestamp")/1000), "yyyy-MM-dd")
.alias("timestamp")).where(whereFilt)
def concatClientipLogs(start_date, duration, whereFilt):
dates= [start_date + timedelta(days=d) for d in range(duration.days+1)]
return selectClientip(fetchLogs(dates), whereFilt)
#https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.transform.html#pyspark.sql.functions.transform
def datebinning(col):
#return F.dayofyear(col)
#return F.weekofyear(col)
return F.month(col)
def countUniqueIPinDatebin(df):
return df.groupBy("datebin").agg(countDistinct("clientip"))
```
%% Cell type:code id:24fd5729 tags:
``` python
homeAccessLogs = concatClientipLogs(start_date, duration, 'data.program == "httpd" AND data.sitename == "home.cern"')
```
%% Cell type:code id:63c5d412 tags:
``` python
homeAccessLogsBinned= countUniqueIPinDatebin(homeAccessLogs.withColumn('datebin', datebinning('timestamp')).select(["clientip","datebin"])).toPandas().sort_values(by="datebin")
```
%% Cell type:code id:ed7b9057 tags:
``` python
homeAccessLogsBinned.sort_values(by="datebin")
# F.month:
# datebin count(clientip)
#0 12 108026
#1 9 334389
#2 10 306606
#3 11 335701
# F.dayofyear:
# avg(Sept): 13291
# 0.8386309601436553
```
%% Output
datebin count(clientip)
1 8 415728
0 9 17528
%% Cell type:code id:2e3536a1 tags:
``` python
homeAccessLogsBinned.sort_values(by="datebin").to_csv("homeAccessLogs.csv", index=False)
```
%% Cell type:code id:3fc7d98b tags:
``` python
```
......@@ -8,9 +8,9 @@ spec:
kind: ClusterTask
params:
- name: drupalSite
value: drupalsite-sample
value: arts
- name: backupName
value: ravineet-1-tekton-test-fbfe0
value: arts-901a-20220530000645
- name: namespace
value: ravineet-1
value: arts
serviceAccountName: tektoncd
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment