Skip to content
Snippets Groups Projects
user avatar
[ACC] Elena Operation authored
0d4ad8c6
History

datascout

Simple package to handle data saving and reading with minimum required libraries. Mainly used as dependance of pyjapcscout, but it can be used for other purposes as well (maybe).

Purpose of this project

The idea is to provide a few sweet functions to go from a nested dict of numpy arrays to parquet (and to pickle, and json) and come back preserving the data types (but for json, for which no coming back is impelmented here!). The aspect related to data types preservation is important for the roud-trip of meachine parameter reading, saving and settings. This package is meant to be simple enough and with very little dependencies to allow for home data analysis without the needs of CERN TN Network or Java libraries.

The basic data unit (or dataset) is assumed to be a (nested) dictionary of numpy values and numpy arrays. Lists are in principle not allowed (at least not supported) inside a dataset. On the other hand, lists might be used to define a list of datasets (e.g. a list of consecutive acquisitions of a accelerator data).

Getting started

First you need to install this package in your (virtual) environment. Presently, the suggested way is to go for local folder installation:

git clone https://gitlab.cern.ch/abpcomputing/sandbox/datascout.git datascout
cd datascout
python -m pip install -e .

so that one can easily update the package following its development by doing a simple git pull within the datascout folder created above.

This package provides the following (main) functions. Note that many of those functions are simple wrappers of external functions (from pandas, pyarrow, awkward), but sometimes with some twiks to make sure data type/shape is somewhat always preserved.

  • dict_to_pandas(input_dict): Creates a pandas dataframe from a (list of) dict.
  • dict_to_awkward(input_dict): Creates an awkward array from a (list of) dict.
  • dict_to_parquet(input_dict, filename): Saves a (list of) dict into a parquet file. In order to do so, 2D arrays are split in 1D arrays of 1D arrays.
  • dict_to_pickle(input_dict, filename): Saves a (list of) dict into a pickle file.
  • dict_to_json(input_dict, filename): Saves a (list of) dict into a json file.
  • json_to_pandas(filename): It loads from a json file a pandas dataframe. This function is not so interesting (because data types/shapes are not preserved), but provided for convenience.
  • pandas_to_dict(input_pandas): It converts back a pandas dataframe into a (list of) dict.
  • awkward_to_dict(input_awkward): It converts back a awkward array into a (list of) dict. In order to preserve data type/shape, it re-merges 1D arrays of 1D arrays into 2D arrays.
  • parquet_to_dict(filename): Loads a (list of) dict from a parquet file. In order to preserve data type/shape, it re-merges 1D arrays of 1D arrays into 2D arrays.
  • pickle_to_dict(filname): Loads a (list of) dict from a pickle file.
  • pandas_to_awkward(input_pandas): It creates an awkwardarray starting from a pandas dataframe.
  • awkward_to_pandas(input_awkward): It creates an pandas dataframe starting from a awkward array.
  • parquet_to_pandas(filename): It loads a parquet file into a pandas dataframe. Instead of using the method provided by pandas (which does not preserve single value types and 2D arrays), it first loads the parquet as dict, and then converts it into a pandas dataframe.
  • parquet_to_awkward(filename): It loads a parquet file into a awkward array.
  • save_dict(dictData, folderPath = None, filename = None, fileFormat='parquet'): Additional wrapper of a few functions above to easily save a dict on a file using a supported format (parquet and dict for the time being)
  • load_dict(filename, fileFormat='parquet'): It reads a file assuming a given format and returns its content as a dict (which can be then converted to other formats...)

How to develop it:

I set up this package as:

  1. create the project on gitlab (https://gitlab.cern.ch/abpcomputing/sandbox/datascout)

  2. cloned it on my CERN virtual machine with access to acc-py, so:

git clone https://:@gitlab.cern.ch:8443/abpcomputing/sandbox/datascout.git datascout
source /acc/local/share/python/acc-py/base/pro/setup.sh
acc-py init
acc-py init-ci
  1. filled / added my functions and tests

  2. create a virtual environment where to eventually install it

python -m venv ./venv --system-site-packages
source ./venv/bin/activate
  1. Install the package in "editing mode:" in your virtual environment:
python -m pip install -e .
  1. Create documentation:
acc-py init-docs

then started populating the files under docs folder with the desired content...

  1. Check code style using black:
python -m pip install black
black --diff .  # To see what `black` is proposing to do to your source code
black .         # To let `black` edit the source code
  1. Release product: See acc-py documentation on wikis
acc-py build
acc-py devrelease