Skip to content

Differentiate variable types in pyg data by adding prefix to each variable name

Jay Chan requested to merge jay_differentiate_variable_type into dev

This MR uses a new naming scheme which adds a prefix of either hit_, edge_ or track_ to each variable name in order to accurately reflect the variable type. This is particularly crucial to run the pipeline on the single-particle events (see #35).

In order to adapt to the new variable naming scheme, users need to change the configuration yaml files (note that for the backward compatibility the old configuration yaml file and old pyg objects with the old naming scheme can still be run without issue). These include:

  1. Add hit_ to all node-like variables; add edge_ to all edge-like variables; add track_ to all track-like variables. Note that for track-like variables that correspond to the particle truths, they should be added with track_particle_ instead of track_ (e.g. pt -> track_particle_pt).
  2. Set the flag variable_with_prefix to true. If variable_with_prefix is set to false (current default), the code will execute with backward compatibility, and automatically convert all variable names in the input pyg objects, and in the config yaml files to new naming scheme. It will also convert them back to the old naming scheme in the output pyg format for backward compatibility. If variable_with_prefix is set to true, no conversion will be made. In this case, users need to make sure both the configuration yaml files and the input pyg objects are already with the new naming scheme.

Some additional features are also added to make it easier for users to transition from old naming scheme to the new scheme:

  1. The flag add_variable_name_prefix can be set to true along with variable_with_prefix set to true. In this case, the code will convert the variable names in the input pyg objects. This is useful when a new configuration yaml file (with new naming scheme) is prepared, but the input pyg files are produced with the old naming scheme. Note that with this setting, the output pyg objects will be with the new naming scheme (variable names won't be converted back).
  2. If users need to rerun the data reading stage in order to produce the input pyg objects with the new naming scheme, the csv conversion step doesn't need to be rerun, and only the csv to pyg step needs to be rerun. In this case, users can set the flag skip_csv_conversion to true in the data reader yaml and rerun the data reading stage (need to first remove the existing pyg files).

An set of example config files are provided in examples/Variable_Name_Prefix/README.md. They are a copy of the CTD2023 example, but with the changes that are made to adapt to the new variable naming scheme.

Have fully tested the pipeline with:

  • CTD2023 and metric-learning pipeline new naming scheme
  • CTD2023 and metric-learning pipeline old naming scheme (backward compatibility)
  • Example 1, 2 and 3
  • Single-muon sample

TODO LIST

  • Add hit_ prefix to all hit like variables
  • Add track_ prefix to all track like variables
  • Add edge_ prefix to all edge like variables
  • Change accordingly in module map
  • Change accordingly in GNN stage
  • Change accordingly in track building stage
  • Change accordingly in metric learning stage
  • Change accordingly in filter stage
  • Modify all functions that currently use array size to determine variable types
  • Update all config files
  • Test with ttbar events
  • Test with single-particle events
  • Test with examples
  • Add example configs
  • Test backward compatibility
  • Pass pipeline
Edited by Jay Chan

Merge request reports