Skip to content
Snippets Groups Projects
Shahzad Ali's avatar
Shahzad Ali authored
Edit HZa_v0_trainings.json
e3a0f270
History

HbbHtautauMLToolkit

ML Toolkit for the HZ(ll)a(yy) analysis

This package provides functionality to train ML models as analysis discriminants, study and optimise their performance, and export them into the CxAODFramework. The steps below serve to reproduce the model used in the analysis baseline.

Structure of a full workflow

The full workflow implemented by this package is described in detail below. It encompasses the preprocessing of the training data, the actual training of the model, and the evaluation of its performance. Additional tools are provided to optimise various aspects of the training, e.g. the choice of hyperparameters or input variables.

Cloning the repository

git clone --recursive git@gitlab.cern.ch:shahzad/hzamltoolkit.git

Cloning the repository with https

git clone --recursive https://gitlab.cern.ch/shahzad/hzamltoolkit.git

Preprocessing the training inputs

The training uses MVATrainingTrees produced with the CxAODReader. These raw inputs must be preprocessed so that they can be used to train the discriminant. The type of preprocessing required depends on the model class considered. For a BDT implemented using TMVA, the preprocessing adjusts the range of each input variable to contain 99% of all signal events. (This is to disincentivize the BDT to apply cuts in a signal-depleted region.)

A json configuration file serves to steer the framework. Among other parameters, it defines the path to the training inputs used for a particular campaign. (This configuration file also steers the training; see below for more information.)

source setup_worker.sh
python preprocess_inputs.py --config /path/to/json/file --outdir /path/to/preprocessed/inputs --plots

The --plots flag switches on additional input diagnostic plots to validate the training inputs.

Training the baseline discriminant

The type of ML model used, its input variables, and the hyperparameters are all centrally configured in the json file. It also specifies the analysis region in which the training should be performed. A separate file is used for each region. Configuration files for the ggF-low-mHH, ggF-high-mHH, and VBF analysis categories are available in configs/.

To run the training (automatically using 3-fold mixedCV), it is sufficient to run

source setup_worker.sh
python RunTrainingCampaign.py --config /path/to/json/file --outdir /path/to/training/results

Once the training is complete, diagnostic plots (distributions of discriminant, ROC, ...) are automatically generated and also placed in the output directory. Weight files describing the trained model parameters are also available there. For a TMVA BDT, these are xml files that may be evaluated in the CxAODReader.

To get started with the training itself, it might be worth to play around locally with the

train.py

script.

Performance evaluation

To evaluate the performance on trained models, the script evaluate.py scan be used.

source setup_worker.sh
python evaluate.py --trainingdir /path/to/trained/model/ --config /path/to/trainings/config.json

Combined training + evaluation on a batch system

If you have a larger number of config files, it is useful to offload the training and evaluation steps to a batch system. In this case, run

sh install_venv.sh
source setup_local.sh
python RunTrainingCampaign.py --outdir /path/to/output/directory/ --config /path/to/trainings/config.json

(Note: setting up the virtual environment in the first command only needs to happen once after the installation of the package.)

Exporting the trained model

To evaluate the TMVA BDT, the xml files with the weights need to be placed in CxAODReader_bbtautau/source/CxAODReader_HH_bbtautau/data/. For example, for the ggF BDT in the low-mHH region, the files can be extracted and renamed using

python TMVAModel2Reader.py --model_dir /path/to/training/dir/ --model_name ggF_0_350mHH --outdir /path/to/CxAODReader_bbtautau/source/CxAODReader_HH_bbtautau/data/

Hyperparameter optimisation

sh install_venv.sh
source setup_local.sh
python RunHyperparameterOptimisationCampaign.py --outdir /path/to/output/directory/ --config_template /path/to/config/template.json --hpar_opt_config /path/to/optimisation/config.json

The template configuration file contains the definitions of all training-related variables. The values of the hyperparameters that are being optimised are automatically replaced (in this sense the file acts as a "template").

The last argument contains the path to a configuration file steering the input parameter optimisation. Some examples are available in configs/template/hyper_parameter_opt. One can use the argument "--driver slurm" to submit jobs to a slurm batch system. (default driver is "condor") The hyperparameter optimisation can be visualised with MakeHyperparameterOptimisationPlots.py. A ranked list (inkl. symlinks) of the best trainings can be created with ExtractRankedRuns.

Input parameter optimisation

sh install_venv.sh
source setup_local.sh
python RunInputParameterOptimisationCampaign.py --outdir /path/to/output/directory/ --config_template /path/to/config/template.json --input_opt_config /path/to/optimisation/config.json

The last argument contains the path to a configuration file steering the input parameter optimisation. Some examples are available in configs/template/input_var_opt. One can use the argument "--driver slurm" to submit jobs to a slurm batch system. (default driver is "condor") Visualisations can be generated with MakeInputVariableOptimisationPlots.py, which is run as

python MakeInputVariableOptimisationPlots.py --outdir /path/to/plot/output/directory --input_opt_config /path/to/optimisation/config.json --rundir /path/to/optimisation/campaign/

The figure of merit shown on the plots is the same as that used for the optimisation. It can be set in the configuration file, see input_opt_configs/ggF_inclusive.json for an example.

BDT diagnostics plots from XML files

Several plots can be generated to better understand the structure of the trained BDTs from the XML files. To generate them, run

cd diagnostics
python doBDTDiagnostics.py --traindir /path/to/trainings/directory --outdir /output/directory