Skip to content

Improve preprocessing

Anthony Correia requested to merge anthonyc/preprocessing into main

Fix bugs:

Update:

Improvements

  • Preprocessing:
    • Split the pre-processing and the processing (7f548663)
    • Only 2 files / event out of the preprocessing: hits_particles and particles (can be filtered, combined like before) (43bf4f8e, fba1d670)
      • These two samples can be used by the evaluation
      • These two samples are used in the processing step to build the files used in the metric-learning
    • Only open and save columns that are used in the next steps (24bebbca, dfdd643b, ...)
    • Don't compute or save unnecessary columns: px, py and pz (c89b8d6f)
    • Add plane-wise edges (already implemented in another branch) (6644c96c, 63c15258)
    • Add Mean Squared Error (MSE) between a particle trajectory and a line (already implemented in another branch) (0ff4bc26)
    • Add utilities to compute weights to balance dataset (53ea0264)
    • Split datasets into train and val from processing to make sure that we indeed have the exact same splitting all the way to the final evaluation (c9940046, a454c268, ..., b0982458)
  • Processing
    • Remove selection in processing: all selections will be done in the pre-processing step so that we know what we're doing (6b55fe7e, 1ca487b9)
    • Rename pid into particle_id in the processing step (4f3785d2)
  • Configuration
  • Evaluation
    • Montetracko will use the files produced out of the pre-processing (b0982458)
    • Separate evaluation in train / dev / test sample (b0982458)
    • A fixed test sample will be used: the same for every sample, benchmarked using Allen
    • Find a way of splitting the train and dev samples or know the splitting before processing (9769accf)

Move to a later merge request:

  • In preprocessing, only and only if this can be useful
    • Add partial multi-processing to speed up the pre-processing
    • Add an example of pre-processing that can be used to augment data with electrons in a consistent and efficient way
    • Be able to loop through events located in different parquet files (already partially implemented in another branch)
  • Evaluation after the metric learning of the best result that can be obtained (already partially implemented in another branch)
Edited by Anthony Correia

Merge request reports