New features for the training and evaluation pipeline
Here is what is introduced by this MR:
General
- Fix few typos and typehints
- Define few notebooks in
analyses
to analyse the data
Processing
- Be able to load a variable from the dataframe while changing its name to avoid clashes in the PyTorch Dataset object (for
x
)
Models
- Introduce unidirectional graph: if
bidir
is set toFalse
in the configuration file, a unidirectional graph is used (see #6 (closed)) - Be able to filter and alter events at the inference stage of embedding (and GNN). Introduce the
building
andfiltering
arguments in the pipeline configuration, which refer to the name of a function defined inEmbedding/building_custom.py
. These arguments can be None, one string or a list of strings.filtering
is only applied to the train and val samples. - Refactor how the dataframes are loaded in a model to improve consistency
- Be able to build triplets from doublet graph (see #5 (closed))
Evaluation
- Replace some Bokeh functions by matplotlib functions. Save them in PNG (for presentations) and PDF (for reports and papers)
- Upgrade montetracko: mainly fix typehints
- Vary the
radius
(for embedding) andscore_cut
(for GNN) and compare the results in terms of efficiency, clone rate and hit efficiency after matching
GNN (see #4 (closed))
- Better control of the number of layers in the Interaction GNN: we can now control the number of layers of every MLP used in the GNN
- Introduce loss for triplets with penalty term
- Refactor
GNNBase
: Don't repeat the code used in training and validation
Now, the new pipeline configurations are defined in the pipeline_configs
folder.
I always leave the full_pipeline.ipynb
notebook empty (all the outputs cleared). Instead, I copy it (e.g., full_pipeline-focal-loss-pid-fixed.ipynb) and only change full_pipeline.ipynb
if I need to introduce changes for everyone.
Edited by Fotis Giasemis