datascout
Simple package to handle data saving and reading with minimum required libraries.
Mainly used as dependance of pyjapcscout
, but it can be used for other purposes as well (maybe).
Purpose of this project
The idea is to provide a few sweet functions to go from a nested dict
of numpy
arrays to parquet
(and to pickle
, and json
) and come back preserving the data types (but for json
, for which no coming back is impelmented here!). The aspect related to data types preservation is important for the roud-trip of meachine parameter reading, saving and settings.
This package is meant to be simple enough and with very little dependencies to allow for home data analysis without the needs of CERN TN Network or Java libraries.
The basic data unit (or dataset) is assumed to be a (nested) dictionary of numpy values and numpy arrays. Lists are in principle not allowed (at least not supported) inside a dataset. On the other hand, lists might be used to define a list of datasets (e.g. a list of consecutive acquisitions of a accelerator data).
Getting started
First you need to install this package in your (virtual) environment. Presently, the suggested way is to go for local folder installation:
git clone https://gitlab.cern.ch/abpcomputing/sandbox/datascout.git datascout
cd datascout
python -m pip install -e .
so that one can easily update the package following its development by doing a simple git pull
within the datascout
folder created above.
This package provides the following (main) functions. Note that many of those functions are simple wrappers of external functions (from pandas
, pyarrow
, awkward
), but sometimes with some twiks to make sure data type/shape is somewhat always preserved.
-
dict_to_pandas(input_dict)
: Creates apandas
dataframe from a (list of)dict
. -
dict_to_awkward(input_dict)
: Creates anawkward
array from a (list of)dict
. -
dict_to_parquet(input_dict, filename)
: Saves a (list of)dict
into aparquet
file. In order to do so, 2D arrays are split in 1D arrays of 1D arrays. -
dict_to_pickle(input_dict, filename)
: Saves a (list of)dict
into apickle
file. -
dict_to_json(input_dict, filename)
: Saves a (list of)dict
into ajson
file. -
json_to_pandas(filename)
: It loads from ajson
file apandas
dataframe. This function is not so interesting (because data types/shapes are not preserved), but provided for convenience. -
pandas_to_dict(input_pandas)
: It converts back apandas
dataframe into a (list of)dict
. -
awkward_to_dict(input_awkward)
: It converts back aawkward
array into a (list of)dict
. In order to preserve data type/shape, it re-merges 1D arrays of 1D arrays into 2D arrays. -
parquet_to_dict(filename)
: Loads a (list of)dict
from aparquet
file. In order to preserve data type/shape, it re-merges 1D arrays of 1D arrays into 2D arrays. -
pickle_to_dict(filname)
: Loads a (list of)dict
from apickle
file. -
pandas_to_awkward(input_pandas)
: It creates anawkward
array starting from apandas
dataframe. -
awkward_to_pandas(input_awkward)
: It creates anpandas
dataframe starting from aawkward
array. -
parquet_to_pandas(filename)
: It loads aparquet
file into apandas
dataframe. Instead of using the method provided bypandas
(which does not preserve single value types and 2D arrays), it first loads the parquet asdict
, and then converts it into apandas
dataframe. -
parquet_to_awkward(filename)
: It loads aparquet
file into aawkward
array. -
save_dict(dictData, folderPath = None, filename = None, fileFormat='parquet')
: Additional wrapper of a few functions above to easily save adict
on a file using a supported format (parquet
anddict
for the time being) -
load_dict(filename, fileFormat='parquet')
: It reads a file assuming a given format and returns its content as adict
(which can be then converted to other formats...)
How to develop it:
I set up this package as:
-
create the project on gitlab (https://gitlab.cern.ch/abpcomputing/sandbox/datascout)
-
cloned it on my CERN virtual machine with access to acc-py, so:
git clone https://:@gitlab.cern.ch:8443/abpcomputing/sandbox/datascout.git datascout
source /acc/local/share/python/acc-py/base/pro/setup.sh
acc-py init
acc-py init-ci
-
filled / added my functions and tests
-
create a virtual environment where to eventually install it
python -m venv ./venv --system-site-packages
source ./venv/bin/activate
- Install the package in "editing mode:" in your virtual environment:
python -m pip install -e .
- Create documentation:
acc-py init-docs
then started populating the files under docs folder with the desired content...
- Check code style using
black
:
python -m pip install black
black --diff . # To see what `black` is proposing to do to your source code
black . # To let `black` edit the source code
- Release product: See acc-py documentation on wikis
acc-py build
acc-py devrelease