Skip to content

Preprocessing improvements (summary of several issues)

The preprocessing chain was rewritten last year, but still needs some improvements. This issue is summarising what needs to be adapted.

  • Redo the Configuration.py functions with proper data classes etc. #181 (closed)
  • Indexing jets rather than copying tracks #93 (closed)
  • Remove intermediate preprocessing stages for speedup #102 (closed)
    • this boils down for now to merge the apply_scales and write steps
  • Making scaling calculation standalone function #104 (closed)
  • Unify preprocessing and reading in test files #203 (closed)
  • Making creation of a hybrid sample simpler #140 (closed)
  • Adding Preprocessing base class #152 (closed)
  • PDF sampling with custom function #190 (closed)
  • Save the valid track flag in a separate group in the preprocessed files #191 (closed)
  • Option to specify number of jets used to calculate PDFs #204 (closed)
  • Adding option for preprocessing specifying resampled file #205 (closed)
  • Dump at half precision but ensure preprocessing is at full precision #212

Additional preprocessing improvements: (no separate issues open for them)

  • better handling of labels in the final train file (such that multiple labels can be accessed by name in their own subgroup)
  • deprecate one hot label encodings (prefer sparse loss computation)
  • jet flavour label should be ready for training: 0,4,5->0,1,2
  • option to concatenate jet and track inputs for training GN1 etc to avoid doing this during data loading
  • use lzf compression by default for final X_tracks_train (speeds things up)
  • harmonise scale dict format for jet and track scales
  • store track label definitions as attribute in the tracks/labels dataset
Edited by Ivan Oleksiyuk