diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..5ceb3864c2911029f0a6010fadab352e4b8e2d07 --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +venv diff --git a/README.md b/README.md index 3140e112b75728c95407e3ebc913c84574380a5d..9739f650dc2bdf3ba197d9eaa285ad028f7dfe5e 100644 --- a/README.md +++ b/README.md @@ -26,7 +26,13 @@ acc-py init-ci 3. filled / added my functions and tests -4. used it in some virtual environment as +4. create a virtual environment where to eventually install it +``` +python -m venv ./venv --system-site-packages +source ./venv/bin/activate +``` + +5. Install the package in "editing mode:" in your virtual environment: ``` python -m pip install -e . ``` diff --git a/datascout/__init__.py b/datascout/__init__.py index 5847394c5acf31207d62506fa28084a2a12ddebf..5bc1d08fff8580ada6c059c4d4e9c6ffab4a2d93 100644 --- a/datascout/__init__.py +++ b/datascout/__init__.py @@ -6,6 +6,14 @@ list of sweet functions for data conversion and writing to disk __version__ = "0.0.1.dev0" +# to look at pyarrow, typically not used by a user, +# but key functions for this package +from ._datascout import dict_to_pyarrow +from ._datascout import pyarrow_to_parquet +from ._datascout import parquet_to_pyarrow +from ._datascout import pyarrow_to_dict +from ._datascout import pyarrow_to_pandas + # for the user from ._datascout import dict_to_pandas from ._datascout import dict_to_awkward @@ -13,11 +21,14 @@ from ._datascout import dict_to_parquet from ._datascout import dict_to_pickle from ._datascout import dict_to_json +# not so interesting, but provided for convenience +from ._datascout import json_to_pandas + # coming back from ._datascout import pandas_to_dict from ._datascout import awkward_to_dict -from ._datascout import pickle_to_dict from ._datascout import parquet_to_dict +from ._datascout import pickle_to_dict # between pandas and awkward from ._datascout import pandas_to_awkward @@ -27,8 +38,8 @@ from ._datascout import awkward_to_pandas from ._datascout import parquet_to_pandas from ._datascout import parquet_to_awkward -# to look at pyarrow, typically not used by a user -from ._datascout import dict_to_pyarrow -from ._datascout import pyarrow_to_parquet -from ._datascout import parquet_to_pyarrow -from ._datascout import pyarrow_to_dict +# other hidden functions that could be useful for debug +from ._datascout import _find_lists +from ._datascout import _compare_data + + diff --git a/datascout/_datascout.py b/datascout/_datascout.py index 58c83602acbcf7325a790259709aabd961718ffa..2166aa14cf0314c983f4d0463927825e0a859ee5 100644 --- a/datascout/_datascout.py +++ b/datascout/_datascout.py @@ -11,13 +11,14 @@ import pyarrow as pa import pickle import datetime import copy +import os ###### # Functions needed to split 2D arrays -def split_2D_array(val, in_memory=False, split_to_list=False, verbose=False): +def _split_2D_array(val, in_memory=False, split_to_list=False, verbose=False): ''' - split_2D_array(val, in_memory=False, split_to_list=False, verbose=False) + _split_2D_array(val, in_memory=False, split_to_list=False, verbose=False) It converts numpy 2D arrays into either: - 1D "object" arrays containing 1D val.dtype arrays (split_to_list=False) @@ -52,7 +53,7 @@ def split_2D_array(val, in_memory=False, split_to_list=False, verbose=False): else: return val -def convert_dict_list(data, in_memory=False, split_to_list=False, verbose=False): +def _convert_dict_list(data, in_memory=False, split_to_list=False, verbose=False): ''' Parse the input data, which should be a list or a dict, and convert all 2D arrays into either - 1D object array of 1D arrays @@ -71,24 +72,24 @@ def convert_dict_list(data, in_memory=False, split_to_list=False, verbose=False) if type(data) == list: for entry in data: if type(entry) == list or type(entry) == dict: - entry = convert_dict_list(entry) + entry = _convert_dict_list(entry) elif type(entry) == np.ndarray: - entry = split_2D_array(entry, in_memory=in_memory, split_to_list=split_to_list, verbose=verbose) + entry = _split_2D_array(entry, in_memory=in_memory, split_to_list=split_to_list, verbose=verbose) elif type(data) == dict: for key in data.keys(): if type(data[key]) == list or type(data[key]) == dict: - data[key] = convert_dict_list(data[key]) + data[key] = _convert_dict_list(data[key]) elif type(data[key]) == np.ndarray: - data[key] = split_2D_array(data[key], in_memory=in_memory, split_to_list=split_to_list, verbose=verbose) + data[key] = _split_2D_array(data[key], in_memory=in_memory, split_to_list=split_to_list, verbose=verbose) return data ###### # Functions needed to re-merge 1D arrays of 1D arrays into 2D arrays -def merge_to_2D(val, string_as_obj=False, verbose=False): +def _merge_to_2D(val, string_as_obj=False, verbose=False): ''' - merge_to_2D(val, string_as_obj=False, verbose=False) + _merge_to_2D(val, string_as_obj=False, verbose=False) It converts back numpy arrays of "object" dtype into 2D arrays. By construction, if conversion actually occurs, this operation makes a copy of @@ -113,7 +114,7 @@ def merge_to_2D(val, string_as_obj=False, verbose=False): else: return val -def revert_dict_list(data, in_memory=False, string_as_obj=False, verbose=False): +def _revert_dict_list(data, in_memory=False, string_as_obj=False, verbose=False): ''' Parse the input data, which should be a list or a dict, and convert all 1D arrays of "object" type into 2D arrays of the proper data type. @@ -135,29 +136,29 @@ def revert_dict_list(data, in_memory=False, string_as_obj=False, verbose=False): if type(data) == list: for entry in data: if type(entry) == dict: - revert_dict_list(entry) + _revert_dict_list(entry) elif type(entry) == list or type(entry) == np.ndarray: - entry = merge_to_2D(entry, string_as_obj=string_as_obj, verbose=verbose) + entry = _merge_to_2D(entry, string_as_obj=string_as_obj, verbose=verbose) if len(entry) > 0 and isinstance(entry.flatten()[0], dict): for nasted_data in entry.flatten(): - revert_dict_list(nasted_data) + _revert_dict_list(nasted_data) elif type(data) == dict: for key in data.keys(): if type(data[key]) == dict: - revert_dict_list(data[key]) + _revert_dict_list(data[key]) elif type(data[key]) == list or type(data[key]) == np.ndarray: - data[key] = merge_to_2D(data[key], string_as_obj=string_as_obj, verbose=verbose) + data[key] = _merge_to_2D(data[key], string_as_obj=string_as_obj, verbose=verbose) if len(data[key]) > 0 and isinstance(data[key].flatten()[0], dict): for nasted_data in data[key].flatten(): - revert_dict_list(nasted_data) + _revert_dict_list(nasted_data) return data ###### # CORE function of this project: it allows to convert a pyarrow object into a dict # -def convert_parrow_data(data, treat_str_arrays_as_str=True, use_list_for_2D_array=False): +def _convert_parrow_data(data, treat_str_arrays_as_str=True, use_list_for_2D_array=False): ''' - convert_parrow_data(data) + _convert_parrow_data(data) it extract data from a pyarrow object to a "standard" pyjapcscout-like dict dataset, i.e. a dictionary with only not null numpy objects/arrays and no lists (but if you enable use_list_for_2D_array) @@ -174,13 +175,13 @@ def convert_parrow_data(data, treat_str_arrays_as_str=True, use_list_for_2D_arra # those should be value, header, exception for item in data[column][0].items(): # this can be iterated... I think - device_dict[item[0]] = convert_parrow_data(item[1]) + device_dict[item[0]] = _convert_parrow_data(item[1]) output[column] = device_dict return output if isinstance(data, pa.StructScalar): output_dict = dict() for item in data.items(): - output_dict[item[0]] = convert_parrow_data(item[1]) + output_dict[item[0]] = _convert_parrow_data(item[1]) return output_dict elif isinstance(data, pa.ListScalar): if isinstance(data.type.value_type, pa.lib.ListType): @@ -194,12 +195,12 @@ def convert_parrow_data(data, treat_str_arrays_as_str=True, use_list_for_2D_arra if use_list_for_2D_array: auxOutput = [] for auxValue in data.values: - auxOutput.append(convert_parrow_data(auxValue)) + auxOutput.append(_convert_parrow_data(auxValue)) return auxOutput else: auxOutput = np.empty((len(data.values),), dtype=object) for i, auxValue in enumerate(data.values): - auxOutput[i] = convert_parrow_data(auxValue) + auxOutput[i] = _convert_parrow_data(auxValue) return auxOutput else: # could be a 1D array of some data type @@ -222,7 +223,7 @@ def convert_parrow_data(data, treat_str_arrays_as_str=True, use_list_for_2D_arra ###### Some important functions not so interesting for the standard user, but fundamental def dict_to_pyarrow(input_dict): - my_data_dict_converted = convert_dict_list(input_dict, in_memory=False, split_to_list=False, verbose=False) + my_data_dict_converted = _convert_dict_list(input_dict, in_memory=False, split_to_list=False, verbose=False) return pa.Table.from_pandas(pd.DataFrame([my_data_dict_converted])) def pyarrow_to_parquet(input_pa, filename): @@ -232,11 +233,10 @@ def parquet_to_pyarrow(filename): return pq.read_table(filename) def pyarrow_to_dict(input_pa): - return convert_parrow_data(input_pa) - -def pyarrow_to_dict(input_pa): - return convert_parrow_data(input_pa) + return _convert_parrow_data(input_pa) +def pyarrow_to_pandas(input_pa): + return dict_to_pandas(pyarrow_to_dict(input_pa)) ####### The functions interesting for the user @@ -251,10 +251,16 @@ def dict_to_awkward(input_dict): def dict_to_parquet(input_dict, filename): # we could also just go to pandas, and then to parquet. # dict_to_pandas(input_dict).to_parquet(filename) - pyarrow_to_parquet(dict_to_pyarrow(input_dict), filename+'.parquet') + name, ext = os.path.splitext(filename) + if len(ext) == 0: + filename = filename+'.parquet' + pyarrow_to_parquet(dict_to_pyarrow(input_dict), filename) def dict_to_pickle(input_dict, filename): - with open(filename+'.pkl', 'wb') as handle: + name, ext = os.path.splitext(filename) + if len(ext) == 0: + filename = filename+'.pkl' + with open(filename, 'wb') as handle: pickle.dump(input_dict, handle, protocol=pickle.HIGHEST_PROTOCOL) def dict_to_json(input_dict, filename): @@ -279,7 +285,7 @@ def awkward_to_dict(input_awkward, row_index=0): ''' it converts the specified row of an awkward array into a pyjapcscout-like dict ''' - return convert_parrow_data(ak.to_arrow(input_awkward)[row_index]) + return _convert_parrow_data(ak.to_arrow(input_awkward)[row_index]) def pickle_to_dict(filename): with open(filename, 'rb') as handle: @@ -287,18 +293,14 @@ def pickle_to_dict(filename): return load_dict def parquet_to_dict(filename): - return pyarrow_to_dict(parquet_to_pyarrow) + return pyarrow_to_dict(parquet_to_pyarrow(filename)) # between pandas and awkward def pandas_to_awkward(input_pandas): - print("TODO") - return - input_pandas = input_pandas.copy() - # I need to split it 2D arrays... - #return dict_to_awkward(pandas_to_dict(input_pandas)) + return dict_to_awkward(pandas_to_dict(input_pandas)) def awkward_to_pandas(input_awkward): - print("TODO") + return dict_to_pandas(awkward_to_dict) # reading from parquet to pandas without type loss def parquet_to_pandas(filename): @@ -310,3 +312,55 @@ def parquet_to_pandas(filename): def parquet_to_awkward(filename): return ak.from_parquet(filename) + + +####### Some additional functions for debugging purposes + +def _find_lists(data, verbose = False): + ''' + Look inside data (assumed to be a dict) and tell if some fields are actually lists. + In theory, `datascout` package is meant to be used only on dicts that do NOT contain any list! + ''' + for key, value in data.items(): + if verbose: print(key) + if isinstance(value, list): + print(key+" is a list!") + elif isinstance(value, dict): + _find_lists(value) + else: + if verbose: print(" ..is "+str(type(value))) + + + +def _compare_data(data1, data2): + ''' + Compares two dictionaries or lists and show the differences (of type or data type). + For a full comparison, it is sometimes best to call this function also with inverted + ''' + def not_equal(a, b): + print(' ------ ') + print(str(a) + ' (' + str(type(a)) + ')') + print(' NOT EQUAL ') + print(str(b) + ' (' + str(type(b)) + ')') + print(' ------ ') + + if (type(data1) != type(data2)) or (hasattr(data1, '__len__') and (len(data1) != len(data2))): + not_equal(data1, data2) + elif isinstance(data1, list): + for i in range(len(data1)): + _compare_data(data1[i], data2[i]) + elif isinstance(data1, dict): + _compare_data(data1.keys(), data2.keys()) + for key in data1.keys(): + _compare_data(data1[key], data2[key]) + elif isinstance(data1, np.ndarray): + if data1.dtype != object: + if not np.array_equal(data1, data2): + not_equal(data1, data2) + elif data1.shape == data2.shape: + for i in range(data1.size): + _compare_data(data1.flatten()[i], data2.flatten()[i]) + else: + not_equal(data1, data2) + + diff --git a/datascout/tests/test_dataconversion.py b/datascout/tests/test_dataconversion.py new file mode 100644 index 0000000000000000000000000000000000000000..7a97fd4bb4c811b78b59a8b62334063948344200 --- /dev/null +++ b/datascout/tests/test_dataconversion.py @@ -0,0 +1,126 @@ +""" +Try different functions on an example dataset + +""" +import datascout +import numpy as np +import pandas as pd +import awkward as ak +import pyarrow.parquet as pq +import numpy as np +import pyarrow as pa +import pickle +import datetime +import copy +import os + +def generate_data_dict(): + ''' + Simply generate a dictionary with some values that should be compatible with this package. + Note: only 'device1' contains random data... + ''' + return {'device1': {'value': { 'property1':np.int8(np.random.rand(1)[0]*10**2), + 'property2':np.int8(np.random.rand(1,43)*10**2), + 'property3':np.int8(np.random.rand(10,3)*10**2), + 'property4':np.int8(np.random.rand(50,1)*10**2)}, + 'header': {'acqStamp':np.int64(np.random.rand(1)[0]*10**12),'cycleStamp':np.int64(np.random.rand(1)[0]*10**12)}, + 'exception': ''}, + 'device2': {'value': np.array([[1, 12], [4, 5], [1, 2]], dtype=np.int16), + 'header': {'acqStamp':np.int64(44444),'cycleStamp':np.int64(3455445)}, + 'exception': ''}, + 'device3': {'value': '', + 'header': {'acqStamp':np.int64(44444),'cycleStamp':np.int64(0)}, + 'exception': 'Cipolla'}, + 'device4': {'value': { 'property5':'This is string', + 'property6':np.array(['my', 'list'], dtype=str), #np.str_ or object? -> np.str_! + 'property7':np.array([['my'], ['list'],['long']], dtype=str), + 'property8':np.array([['my', 'list'], ['of', 'more'], ['val', 'string']], dtype=str), + }, + 'header': {'acqStamp':np.int64(55555),'cycleStamp':np.int64(3455445)}, + 'exception': ''}, + 'device5': {'value': { 'property9':{'JAPC_FUNCTION': {'X': np.array([1, 2, 3, 4], dtype=np.float64), 'Y':np.array([14, 2, 7, 5], dtype=np.float64)}}, + 'property6':np.array([{'JAPC_FUNCTION': {'X': np.array([1, 2], dtype=np.float64), 'Y':np.array([14, 2], dtype=np.float64)}}, {'JAPC_FUNCTION':{'X': np.array([3, 4], dtype=np.float64), 'Y':np.array([7, 5], dtype=np.float64)}}], dtype=object), + }, + 'header': {'acqStamp':np.int64(4444444),'cycleStamp':np.int64(0)}, + 'exception': ''}, + 'device6': {'value': {'JAPC_FUNCTION': {'X': np.array([1, 2, 3, 4], dtype=np.float64), 'Y':np.array([14, 2, 7, 5], dtype=np.float64)}}, + 'header': {'acqStamp':np.int64(4455444),'cycleStamp':np.int64(0)}, + 'exception': ''}, + 'device7': {'value': { 'property10':{'JAPC_ENUM':{'code':np.int64(2), 'string':'piero'}}, + 'property11':np.array([{'JAPC_ENUM':{'code':np.int64(3), 'string':'carlo'}}, {'JAPC_ENUM':{'code':np.int64(4), 'string':'micio'}}], dtype=object), + 'property12':{'JAPC_ENUM_SET':{'codes':np.array([2, 8], dtype=np.int64), 'aslong':np.int64(123), 'strings':np.array(['nieva','po'], dtype=str)}}, #np.str_ + 'property13':np.array([{'JAPC_ENUM_SET':{'codes':np.array([7,44], dtype=np.int64), 'aslong':np.int64(123), 'strings':np.array(['nieva','po'], dtype=str)}}, + {'JAPC_ENUM_SET':{'codes':np.array([5,6], dtype=np.int64), 'aslong':np.int64(77), 'strings':np.array(['nettuno','plutone'], dtype=str)}} + ], dtype=object), + 'property14':np.array([{'JAPC_ENUM_SET':{'codes':np.array([], dtype=np.int64), 'aslong':np.int64(0), 'strings':np.array([], dtype=str)}}, + {'JAPC_ENUM_SET':{'codes':np.array([5,6], dtype=np.int64), 'aslong':np.int64(77), 'strings':np.array(['nettuno','plutone'], dtype=str)}} + ], dtype=object)}, + 'header': {'acqStamp':np.int64(44333444),'cycleStamp':np.int64(0)}, + 'exception': ''}, + 'device8': {'value': {'JAPC_ENUM_SET':{'codes':np.array([2, 8], dtype=np.int64), 'aslong':np.int64(123), 'strings':np.array(['nieva','po'], dtype=str)}}, + 'header': {'acqStamp':np.int64(4),'cycleStamp':np.int64(0)}, + 'exception': 'no data for xxxx'}, + 'device9': {'value': {'cipolla' : np.array([], dtype=str) }, + 'header': {'acqStamp':np.int64(4),'cycleStamp':np.int64(0)}, + 'exception': 'no data for xxxx'}} + + + +def test_data_conversion(): + # generate dataset + my_data_dict = generate_data_dict() + + # make a reference copy of selected dict + my_data_dict_ref = copy.deepcopy(my_data_dict) + + # go to panda and back without altering initial data + my_pandas = datascout.dict_to_pandas(my_data_dict) + datascout._compare_data(my_data_dict, my_data_dict_ref) + my_data_back = datascout.pandas_to_dict(my_pandas) + datascout._compare_data(my_data_back, my_data_dict_ref) + + # go to pyarrow and back without altering initial data + my_pyarrow = datascout.dict_to_pyarrow(my_data_dict) + datascout._compare_data(my_data_dict, my_data_dict_ref) + my_data_back = datascout.pyarrow_to_dict(my_pyarrow) + datascout._compare_data(my_data_back, my_data_dict_ref) + + # go to awkward and back without altering initial data + my_ak = datascout.dict_to_awkward(my_data_dict) + datascout._compare_data(my_data_dict, my_data_dict_ref) + my_data_back = datascout.awkward_to_dict(my_ak) + datascout._compare_data(my_data_back, my_data_dict_ref) + + # a long chain + my_data_back = datascout.awkward_to_dict(datascout.pandas_to_awkward(datascout.pyarrow_to_pandas(datascout.dict_to_pyarrow(my_data_dict)))) + datascout._compare_data(my_data_dict, my_data_dict_ref) + datascout._compare_data(my_data_back, my_data_dict_ref) + + +def test_save_load(tmpdir): + # generate dataset + my_data_dict = generate_data_dict() + # make a reference copy of selected dict + my_data_dict_ref = copy.deepcopy(my_data_dict) + + # define temporary filename + temp_filename_parquet = os.path.join(tmpdir.name, 'test.parquet') + temp_filename_pickle = os.path.join(tmpdir.name, 'test.pkl') + + # go to parquet + datascout.dict_to_parquet(my_data_dict, temp_filename_parquet) + datascout._compare_data(my_data_dict, my_data_dict_ref) + my_data_back = datascout.parquet_to_dict(temp_filename_parquet) + datascout._compare_data(my_data_back, my_data_dict_ref) + + # go to pickle + datascout.dict_to_pickle(my_data_dict, temp_filename_pickle) + datascout._compare_data(my_data_dict, my_data_dict_ref) + my_data_back = datascout.pickle_to_dict(temp_filename_pickle) + datascout._compare_data(my_data_back, my_data_dict_ref) + +return +# Function above can be locally tests as: +from pathlib import Path +tmpdir=Path('.') +test_save_load(tmpdir) \ No newline at end of file diff --git a/setup.py b/setup.py index 188a1c6fd255c47043ca57e8673c0000c525512d..a3e9b857c0025affb9fc8d0b82d6f3c649fd034e 100644 --- a/setup.py +++ b/setup.py @@ -16,8 +16,12 @@ with (HERE / 'README.md').open('rt') as fh: REQUIREMENTS: dict = { 'core': [ - # 'mandatory-requirement1', - # 'mandatory-requirement2', + 'numpy', + 'pandas', + 'pyarrow', + 'awkward', + 'datetime', + 'pathlib' ], 'test': [ 'pytest',