Multiple Tracks datasets in preprocessing stage (!285) · Merge requests · atlas-flavor-tagging-tools / algorithms / Umami

Stefano Franchellucci requested to merge sfranche/umami:track-selection-prep into master Dec 01, 2021

Implementation of the option for storing more than one track dataset in the preprocessed samples. This could save processing time and disk space. This required some reworks at preprocessing, training and evaluation stages.

Preprocessing

The option tracks_name in config files => tracks_names now can be either a string or a list, but is treated as a list trough out the preprocessing chain. In all the steps now, when tracks are used, it is done a loop over all the tracks collections, looping on tracks_names.

At the scaling step, the scale_dict has now one keyword for every separate tracks collection. For tracks, the input variables lists in the .yaml file are now read in the following way: track_train_variables => {tracks_name}_train_variables

The final .h5 file now, when tracks are used, has additional datasets (one per tracks collection), the naming is changing X_trk_train => X_{tracks_name}_train

Training and Evaluation

All the changes made are mostly due to the naming updates:
X_trk_train => X_{tracks_name}_train and track_train_variables => {tracks_name}_train_variables.

An additional option is added to the training config, tracks_name, to select the tracks datasets to use for training/evaluation

Edited Dec 17, 2021 by Stefano Franchellucci

Multiple Tracks datasets in preprocessing stage

Preprocessing

Training and Evaluation

Merge request reports