Improve preprocessing
Fix bugs:
Update:
-
Update libraries to last version (and pytorch to >=2) (06f9a14d, c6726100, 1e70774b, f6ae9298, e880aeca, e061054d, 5839a6db) -
Update Docker image
Improvements
- Preprocessing:
-
Split the pre-processing and the processing (7f548663) -
Only 2 files / event out of the preprocessing: hits_particles
andparticles
(can be filtered, combined like before) (43bf4f8e, fba1d670)- These two samples can be used by the evaluation
- These two samples are used in the processing step to build the files used in the metric-learning
-
Only open and save columns that are used in the next steps (24bebbca, dfdd643b, ...) -
Don't compute or save unnecessary columns: px
,py
andpz
(c89b8d6f) -
Add plane-wise edges (already implemented in another branch) (6644c96c, 63c15258) -
Add Mean Squared Error (MSE) between a particle trajectory and a line (already implemented in another branch) (0ff4bc26) -
Add utilities to compute weights to balance dataset (53ea0264) -
Split datasets into train and val from processing to make sure that we indeed have the exact same splitting all the way to the final evaluation (c9940046, a454c268, ..., b0982458)
-
- Processing
- Configuration
-
Be able to control the features in the configuration file (dfdd643b) -
Remove distinction between cell features and space features for now (45e3cc92, 4fd1a3c9, 1d667389) -
Every output in the same output directory: avoid to save pictures or repository in a common folder (759f4e3d, 4471297c, bedc028d, 74f5824d)
-
- Evaluation
-
Montetracko will use the files produced out of the pre-processing (b0982458) -
Separate evaluation in train / dev / test sample (b0982458) -
A fixed test sample will be used: the same for every sample, benchmarked using Allen -
Find a way of splitting the train and dev samples or know the splitting before processing (9769accf)
-
Move to a later merge request:
- In preprocessing, only and only if this can be useful
- Add partial multi-processing to speed up the pre-processing
- Add an example of pre-processing that can be used to augment data with electrons in a consistent and efficient way
- Be able to loop through events located in different parquet files (already partially implemented in another branch)
- Evaluation after the metric learning of the best result that can be obtained (already partially implemented in another branch)
Edited by Anthony Correia