Skip to content

Draft: Add Support for the GNN

Samuel Van Stroud requested to merge svanstro/umami:svanstro/gnn-support into master

This is a bit of a work in progress. So far I have done the following:

  • Add gnn flag to preprocessing config yaml. If enabled, write track labels, and turn off gzip compression for the final h5 outputs files.
  • Extend the concept of "spectator variables" to have jet spectators and track spectators. The specified variables are included in the output files.
  • Add a new resampling method, "None" which performs no resampling. This can be used for the creation of test samples that have the same format as the training files. I think it is be good to avoid having two different formats for train and test samples (as we have currently), as then you need two different functions for inputting the data to the models at train vs test time, and this introduces potential for bugs.
  • Add support for the creation of ttbar of Z' only datasets. Useful for creating the separate ttbar/Z' test datasets, but also for e.g. high pT trainings.
  • Add some high level shell scripts to run all the relevant preprocessing commands (can implement these as Python scripts if the motivation for them is supported).

I would also suggest the following change to the way the train/val/test split is handled (not currently implemented). At the moment we use mod 2 to separate train from val/test events (which means we lose 1/2 of the training statistics). The val/test samples are lifted from halfway through the preprocessing pipeline, after the merging stage, which means they are in a different format to the training samples. As mentioned, I think this is not ideal because we need to write two sets of functions to load data to the model. Instead, we could split the input files randomly into 3 groups (train/val/test), based on user configured proportions. We have a config file for each of train/val/test, which all have very similar structure, but different values for njets, and also the new "None" sampling method for the test samples. The umami preprocessing pipeline would be run independently for each of the three groups of files. The preprocessing is completely independent for the train/val/test samples, and we get output files in the same format. Plus the user has more control over the details of the val and test samples - for example they can create a validation sample which more closely resembles the training sample, which is useful to monitor trainings. For detailed overtraining studies, it is also useful to have test samples with the same kinematic distributions as the training files.

@mguth @alfroch

Merge request reports