Skip to content

New features for the training and evaluation pipeline

Anthony Correia requested to merge anthonyc/training_update into main

Here is what is introduced by this MR:

General

  • Fix few typos and typehints
  • Define few notebooks in analyses to analyse the data

Processing

  • Be able to load a variable from the dataframe while changing its name to avoid clashes in the PyTorch Dataset object (for x)

Models

  • Introduce unidirectional graph: if bidir is set to False in the configuration file, a unidirectional graph is used (see #6 (closed))
  • Be able to filter and alter events at the inference stage of embedding (and GNN). Introduce the building and filtering arguments in the pipeline configuration, which refer to the name of a function defined in Embedding/building_custom.py. These arguments can be None, one string or a list of strings. filtering is only applied to the train and val samples.
  • Refactor how the dataframes are loaded in a model to improve consistency
  • Be able to build triplets from doublet graph (see #5 (closed))

Evaluation

  • Replace some Bokeh functions by matplotlib functions. Save them in PNG (for presentations) and PDF (for reports and papers)
  • Upgrade montetracko: mainly fix typehints
  • Vary the radius (for embedding) and score_cut (for GNN) and compare the results in terms of efficiency, clone rate and hit efficiency after matching

GNN (see #4 (closed))

  • Better control of the number of layers in the Interaction GNN: we can now control the number of layers of every MLP used in the GNN
  • Introduce loss for triplets with penalty term
  • Refactor GNNBase: Don't repeat the code used in training and validation

Now, the new pipeline configurations are defined in the pipeline_configs folder. I always leave the full_pipeline.ipynb notebook empty (all the outputs cleared). Instead, I copy it (e.g., full_pipeline-focal-loss-pid-fixed.ipynb) and only change full_pipeline.ipynb if I need to introduce changes for everyone.

Edited by Fotis Giasemis

Merge request reports

Loading