HbbHtautauMLToolkit
ML Toolkit for the HZ(ll)a(yy) analysis
This package provides functionality to train ML models as analysis discriminants, study and optimise their performance, and export them into the CxAODFramework
.
The steps below serve to reproduce the model used in the analysis baseline.
Structure of a full workflow
The full workflow implemented by this package is described in detail below. It encompasses the preprocessing of the training data, the actual training of the model, and the evaluation of its performance. Additional tools are provided to optimise various aspects of the training, e.g. the choice of hyperparameters or input variables.
Cloning the repository
git clone --recursive git@gitlab.cern.ch:shahzad/hzamltoolkit.git
Cloning the repository with https
git clone --recursive https://gitlab.cern.ch/shahzad/hzamltoolkit.git
Preprocessing the training inputs
The training uses MVATrainingTrees
produced with the CxAODReader
.
These raw inputs must be preprocessed so that they can be used to train the discriminant. The type of preprocessing required depends on the model class considered.
For a BDT implemented using TMVA, the preprocessing adjusts the range of each input variable to contain 99% of all signal events.
(This is to disincentivize the BDT to apply cuts in a signal-depleted region.)
A json
configuration file serves to steer the framework. Among other parameters, it defines the path to the training inputs used for a particular campaign.
(This configuration file also steers the training; see below for more information.)
source setup_worker.sh
python preprocess_inputs.py --config /path/to/json/file --outdir /path/to/preprocessed/inputs --plots
The --plots
flag switches on additional input diagnostic plots to validate the training inputs.
Training the baseline discriminant
The type of ML model used, its input variables, and the hyperparameters are all centrally configured in the json
file.
It also specifies the analysis region in which the training should be performed. A separate file is used for each region.
Configuration files for the ggF-low-mHH
, ggF-high-mHH
, and VBF
analysis categories are available in configs/
.
To run the training (automatically using 3-fold mixedCV), it is sufficient to run
source setup_worker.sh
python RunTrainingCampaign.py --config /path/to/json/file --outdir /path/to/training/results
Once the training is complete, diagnostic plots (distributions of discriminant, ROC, ...) are automatically generated and also placed in the output directory.
Weight files describing the trained model parameters are also available there.
For a TMVA BDT, these are xml
files that may be evaluated in the CxAODReader
.
To get started with the training itself, it might be worth to play around locally with the
train.py
script.
Performance evaluation
To evaluate the performance on trained models, the script evaluate.py scan be used.
source setup_worker.sh
python evaluate.py --trainingdir /path/to/trained/model/ --config /path/to/trainings/config.json
Combined training + evaluation on a batch system
If you have a larger number of config files, it is useful to offload the training and evaluation steps to a batch system. In this case, run
sh install_venv.sh
source setup_local.sh
python RunTrainingCampaign.py --outdir /path/to/output/directory/ --config /path/to/trainings/config.json
(Note: setting up the virtual environment in the first command only needs to happen once after the installation of the package.)
Exporting the trained model
To evaluate the TMVA BDT, the xml
files with the weights need to be placed in CxAODReader_bbtautau/source/CxAODReader_HH_bbtautau/data/
.
For example, for the ggF
BDT in the low-mHH region, the files can be extracted and renamed using
python TMVAModel2Reader.py --model_dir /path/to/training/dir/ --model_name ggF_0_350mHH --outdir /path/to/CxAODReader_bbtautau/source/CxAODReader_HH_bbtautau/data/
Hyperparameter optimisation
sh install_venv.sh
source setup_local.sh
python RunHyperparameterOptimisationCampaign.py --outdir /path/to/output/directory/ --config_template /path/to/config/template.json --hpar_opt_config /path/to/optimisation/config.json
The template configuration file contains the definitions of all training-related variables. The values of the hyperparameters that are being optimised are automatically replaced (in this sense the file acts as a "template").
The last argument contains the path to a configuration file steering the input parameter optimisation. Some examples are available in configs/template/hyper_parameter_opt
.
One can use the argument "--driver slurm" to submit jobs to a slurm batch system. (default driver is "condor")
The hyperparameter optimisation can be visualised with MakeHyperparameterOptimisationPlots.py
. A ranked list (inkl. symlinks) of the best trainings can be created with ExtractRankedRuns
.
Input parameter optimisation
sh install_venv.sh
source setup_local.sh
python RunInputParameterOptimisationCampaign.py --outdir /path/to/output/directory/ --config_template /path/to/config/template.json --input_opt_config /path/to/optimisation/config.json
The last argument contains the path to a configuration file steering the input parameter optimisation. Some examples are available in configs/template/input_var_opt
.
One can use the argument "--driver slurm" to submit jobs to a slurm batch system. (default driver is "condor")
Visualisations can be generated with MakeInputVariableOptimisationPlots.py
, which is run as
python MakeInputVariableOptimisationPlots.py --outdir /path/to/plot/output/directory --input_opt_config /path/to/optimisation/config.json --rundir /path/to/optimisation/campaign/
The figure of merit shown on the plots is the same as that used for the optimisation. It can be set in the configuration file, see input_opt_configs/ggF_inclusive.json
for an example.
BDT diagnostics plots from XML files
Several plots can be generated to better understand the structure of the trained BDTs from the XML files. To generate them, run
cd diagnostics
python doBDTDiagnostics.py --traindir /path/to/trainings/directory --outdir /output/directory