Preprocessing improvements (summary of several issues)
The preprocessing chain was rewritten last year, but still needs some improvements. This issue is summarising what needs to be adapted.
-
Redo the Configuration.py functions with proper data classes etc. #181 (closed) -
Indexing jets rather than copying tracks #93 (closed) -
Remove intermediate preprocessing stages for speedup #102 (closed) - this boils down for now to merge the
apply_scales
andwrite
steps
- this boils down for now to merge the
-
Making scaling calculation standalone function #104 (closed) -
Unify preprocessing and reading in test files #203 (closed) -
Making creation of a hybrid sample simpler #140 (closed) -
Adding Preprocessing base class #152 (closed) -
PDF sampling with custom function #190 (closed) -
Save the valid track flag in a separate group in the preprocessed files #191 (closed) -
Option to specify number of jets used to calculate PDFs #204 (closed) -
Adding option for preprocessing specifying resampled file #205 (closed) -
Dump at half precision but ensure preprocessing is at full precision #212
Additional preprocessing improvements: (no separate issues open for them)
-
better handling of labels in the final train file (such that multiple labels can be accessed by name in their own subgroup) -
deprecate one hot label encodings (prefer sparse loss computation) -
jet flavour label should be ready for training: 0,4,5->0,1,2 -
option to concatenate jet and track inputs for training GN1 etc to avoid doing this during data loading -
use lzf compression by default for final X_tracks_train (speeds things up) -
harmonise scale dict format for jet and track scales -
store track label definitions as attribute in the tracks/labels
dataset