Improve preprocessing

Review changes
Download
Patches
Plain diff

Anthony Correia requested to merge anthonyc/preprocessing into main May 04, 2023

Overview 4
Commits 282
Pipelines 44
Changes 87

Fix bugs:

Don't suppress all warnings in full_pipeline.ipynb (cc9e4a71, b0aa27fa)

Update:

Update libraries to last version (and pytorch to >=2) (06f9a14d, c6726100, 1e70774b, f6ae9298, e880aeca, e061054d, 5839a6db)
Update Docker image

Improvements

Preprocessing:
- Split the pre-processing and the processing (7f548663)
- Only 2 files / event out of the preprocessing: hits_particles and particles (can be filtered, combined like before) (43bf4f8e, fba1d670)
  - These two samples can be used by the evaluation
  - These two samples are used in the processing step to build the files used in the metric-learning
- Only open and save columns that are used in the next steps (24bebbca, dfdd643b, ...)
- Don't compute or save unnecessary columns: px, py and pz (c89b8d6f)
- Add plane-wise edges (already implemented in another branch) (6644c96c, 63c15258)
- Add Mean Squared Error (MSE) between a particle trajectory and a line (already implemented in another branch) (0ff4bc26)
- Add utilities to compute weights to balance dataset (53ea0264)
- Split datasets into train and val from processing to make sure that we indeed have the exact same splitting all the way to the final evaluation (c9940046, a454c268, ..., b0982458)
Processing
- Remove selection in processing: all selections will be done in the pre-processing step so that we know what we're doing (6b55fe7e, 1ca487b9)
- Rename pid into particle_id in the processing step (4f3785d2)
Configuration
- Be able to control the features in the configuration file (dfdd643b)
- Remove distinction between cell features and space features for now (45e3cc92, 4fd1a3c9, 1d667389)
- Every output in the same output directory: avoid to save pictures or repository in a common folder (759f4e3d, 4471297c, bedc028d, 74f5824d)
Evaluation
- Montetracko will use the files produced out of the pre-processing (b0982458)
- Separate evaluation in train / dev / test sample (b0982458)
- A fixed test sample will be used: the same for every sample, benchmarked using Allen
- Find a way of splitting the train and dev samples or know the splitting before processing (9769accf)

Move to a later merge request:

In preprocessing, only and only if this can be useful
- Add partial multi-processing to speed up the pre-processing
- Add an example of pre-processing that can be used to augment data with electrons in a consistent and efficient way
- Be able to loop through events located in different parquet files (already partially implemented in another branch)
Evaluation after the metric learning of the best result that can be obtained (already partially implemented in another branch)

Edited May 22, 2023 by Anthony Correia

Merge request reports

Assignee Loading

Reviewers Loading

Request review from

Loading

Time tracking Loading

Loading