Preprocessing improvements (summary of several issues)

The preprocessing chain was rewritten last year, but still needs some improvements. This issue is summarising what needs to be adapted.

Redo the Configuration.py functions with proper data classes etc. #181 (closed)
Indexing jets rather than copying tracks #93 (closed)
Remove intermediate preprocessing stages for speedup #102 (closed)
- this boils down for now to merge the apply_scales and write steps
Making scaling calculation standalone function #104 (closed)
Unify preprocessing and reading in test files #203 (closed)
Making creation of a hybrid sample simpler #140 (closed)
Adding Preprocessing base class #152 (closed)
PDF sampling with custom function #190 (closed)
Save the valid track flag in a separate group in the preprocessed files #191 (closed)
Option to specify number of jets used to calculate PDFs #204 (closed)
Adding option for preprocessing specifying resampled file #205 (closed)
Dump at half precision but ensure preprocessing is at full precision #212

Additional preprocessing improvements: (no separate issues open for them)

better handling of labels in the final train file (such that multiple labels can be accessed by name in their own subgroup)
deprecate one hot label encodings (prefer sparse loss computation)
jet flavour label should be ready for training: 0,4,5->0,1,2
option to concatenate jet and track inputs for training GN1 etc to avoid doing this during data loading
use lzf compression by default for final X_tracks_train (speeds things up)
harmonise scale dict format for jet and track scales
store track label definitions as attribute in the tracks/labels dataset

Edited Jan 17, 2024 by Ivan Oleksiyuk

Assignee Loading

Time tracking Loading