Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Sign in
Toggle navigation
Menu
Open sidebar
cloud-infrastructure
data-analytics
Commits
0a62a3dd
Commit
0a62a3dd
authored
Apr 30, 2021
by
smetaj
Browse files
added transient hdfs raw
parent
ebb64123
Changes
1
Hide whitespace changes
Inline
Side-by-side
etl/spark_etl/etl_pipeline.py
View file @
0a62a3dd
...
...
@@ -27,6 +27,7 @@ from adcern.algo_steps import read_window_dataset
from
pyspark.sql.utils
import
AnalysisException
import
shutil
import
subprocess
def
run_pipeline
(
spark
,
config_filepath
):
...
...
@@ -432,9 +433,18 @@ def materialize_locally(spark, config_filepath,
.
write
.
format
(
"parquet"
)
\
.
mode
(
"overwrite"
)
\
.
save
(
hdfs_outfolder
)
copy_to_local
(
hdfs_path
=
hdfs_outfolder
,
local_path
=
local_outfolder
)
raw_folder
=
config_dict
[
"hdfs_out_folder"
]
+
project_code
print
(
"Deleting the raw data saved in %s ..."
%
raw_folder
)
try
:
subprocess
.
call
([
"hdfs"
,
"dfs"
,
"-rm"
,
"-r"
,
raw_folder
])
except
Exception
as
e_delete
:
print
(
'Error while deleting raw_folder directory: '
,
e_delete
)
def
get_normalization_path
(
spark
,
config_filepath
):
"""Given the config dictionary path get the normalization path."""
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment