Commit 483f7e80 authored by Alexander Froch's avatar Alexander Froch Committed by Joschka Birk
Browse files

Merge master in protected branch

parent ac3b748b
......@@ -27,5 +27,4 @@ python_install/
Preprocessing-parameters-*.yaml
# ignoring preprocessing integration test folders
preprocessing_*/
test_train_*/
test_*_model*/
\ No newline at end of file
test_*_model*/
......@@ -189,6 +189,7 @@ The different options are briefly explained here:
| `test_files` | Dict | Optional | Here you can define different test samples that are used in the [`evaluate_model.py`](https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/umami/evaluate_model.py). Those test samples need to be defined in a dict structure shown in the example. The name of the dict entry is relevant and is the unique identifier in the results file which is produced by the [`evaluate_model.py`](https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/umami/evaluate_model.py). `Path` gives the path to the file. For test samples, all samples from the training-dataset-dumper can be used without preprocessing although the preprocessing of Umami produces test samples to ensure orthogonality of the jets with respect to the train sample. |
| `var_dict` | String | Necessary | Path to the variable dict used in the `preprocess_config` to produce the train sample. |
| `exclude` | List | Necessary | List of variables that are excluded from training. Only compatible with DL1r training. To include all, just give an empty list. |
|`tracks_name`| String| Necessary* | Name of the tracks data-set to use for training and evaluation, default is "tracks". <br />* ***This option is necessary when using tracks, but, when working with old preprpocessed files (before January 2022) this option has to be removed form the config file to ensure compatibility*** |
| `NN_structure` | None | Necessary | A dict where all important information for the training are defined. |
| `tagger` | String | Necessary | Name of the tagger that is used/to be trained. |
| `lr` | Float | Necessary | Learning rate which is used for training. |
......
......@@ -241,3 +241,151 @@ There is a [global configuration](https://gitlab.cern.ch/atlas-flavor-tagging-to
| `TFDebugLevel` | Defines the debug level of tensorflow, it takes integer values [0,1,2,3], where 0 prints all messages. |
## Using Visual Studio Code
The editor Visual Studio Code (VSCode) provides very nice and helpful options for developing Umami. VSCode is also able to run a singularity image
with Umami and therefore has all the needed dependencies (Python interpreter, packages, etc.) at hand. A short explanation how to set this up
will be given here.
### Using a Singularity Image on a Remote Machine with Visual Studio Code
To use a Singularity image on a remote machine in VSCode, to use the Python interpreter etc., we need to set up some configs and get some
VSCode extensions. The extensions needed are:
| Extension | Mandatory | Explanation |
|-----------|-----------|-------------|
| [Remote - SSH](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) | Yes | The Remote - SSH extension lets you use any remote machine with a SSH server as your development environment. |
| [Remote - Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) | Yes | The Remote - Containers extension lets you use a singularity container as a full-featured development environment. |
| [Remote - WSL](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-wsl) | Yes (On Windows) | The Remote - WSL extension lets you use VS Code on Windows to build Linux applications that run on the Windows Subsystem for Linux (WSL). |
| [Remote Development](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack) | Yes | The Remote Development extension pack allows you to open any folder in a container, on a remote machine, or in the Windows Subsystem for Linux (WSL) and take advantage of VS Code's full feature set. |
Now, to make everything working, you need to prepare two files. First is your ssh config (can be found in ~/.ssh/config). This file
needs to have the permission of only you are able to write/read it (`chmod 600`). In there, it can look like this for example:
```bash
Host login_node
HostName <Login_Hostname>
User <Login_Username>
IdentityFile <path>/<to>/<private>/<key>
Host working_node tf2~working_node
HostName <working_node_hostname>
User <Username>
ProxyJump login_node
```
The first entry is, for example, the login node of your cluster. The second is the working node. The login node is jumped (used as a bridge). The
second entry also has two names for the entry, one has a `tf2~` in front. This is *important* for the following part, so please add this here.
After adapting the config file, you need to tell VSCode where to find it. This can be set in the `settings.json` of VSCode. You can find/open it in
VSCode when pressing `Ctrl + Shift + P` and start typing `settings`. You will find the option `Preferences: Open Settings (JSON)`. When selecting this,
the config json file of VSCode is opened. There you need to add the following line with the path of your ssh config file added (if the config is in the default path `~/.ssh/config`, you don't need to add this).
```json
"remote.SSH.configFile": "<path>/<to>/<ssh_config>",
"remote.SSH.remoteServerListenOnSocket": false,
"remote.SSH.enableRemoteCommand": true,
```
The second option added here disables the `ListenOnSocket` function which blocks the running of the singularity images in some cases. The third option
will enable the remote command needed for singularity which is blocked when `ListenOnSocket` is `True`. Node: If this gives you errors, you need to switch
to the pre-release version of [Remote - SSH](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh). Just click on the extension in the extension tab and click `Switch to Pre-Release` at the top.
Next, you need to create a executable script, lets call it `singularity-ssh` here, which tells VSCode what to do when connecting. This file is the same
for Linux/Mac but looks a bit different for Windows. After creating this files, you need to make them executable (`chmod +x <file>`) and also add them
in the VSCode settings with:
```json
"remote.SSH.path": "<path>/<to>/<executable>",
```
Now restart VSCode and open the Remote Explorer tab. At the top switch to `SSH Targets` and right-click on the `tf2~` connection and click on
`Connect to Host in Current Window`. VSCode will now install a VSCode server on your ssh target to run on and will ask you to install your
extensions on the ssh target. This will improve the performance of VSCode. It will also ask you which path to open. After that, you can open
a python file and the Python extension will start and should show you at the bottom of VSCode the current Python Interpreter which is used.
If you now click on the errors and warnings right to it, the console will open where you can switch between Problems, Output, Debug Console, Terminal
and Ports. In terminal should be a fresh terminal with the singularity image running. If not, check out output and switch on the right from Tasks to
Remote - SSH to see the output of the ssh connection.
#### Singularity-SSH Linux/Mac
```bash
#!/bin/sh
# Get last command line argument, should be hostname/alias
for trghost; do true; done
# Parse host-aliases of form "venvname~hostname"
imagename=`echo "${trghost}" | sed 's/^\(\(.*\)~\)\?.*$/\2/'`
# Note: VS Code will override "-t" option with "-T".
if [[ "${imagename}" =~ tf2 ]]; then
exec ssh -t "$@" "source /etc/profile && module load tools/singularity/3.8 && singularity shell --nv --contain -B /work -B /home -B /tmp docker://gitlab-registry.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/umamibase-plus:latest-gpu"
else
exec ssh "$@"
fi
```
If somehow this is not working, you can try to extract the hostname directly with this:
```bash
#!/bin/sh
# Get last command line argument, should be hostname/alias
for trghost
do
if [ "${trghost}" = "tf2~working_node" ]; then
image="${trghost}"
fi
done
# Parse host-aliases of form "venvname~hostname"
imagename=`echo "${image}" | sed 's/^\(\(.*\)~\)\?.*$/\2/'`
# Note: VS Code will override "-t" option with "-T".
if [ "${imagename}" = "tf2" ]; then
exec ssh -t "$@" "source /etc/profile && module load tools/singularity/3.8 && singularity shell --nv --contain -B /work -B /home -B /tmp docker://gitlab-registry.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/umamibase-plus:latest-gpu"
else
exec ssh "$@"
fi
```
#### Singularity-SSH Windows
This file needs to have the file ending `.cmd`!
```bat
@echo off
if NOT %1==-V (
for /F "tokens=1,3 delims=~" %%a in ("%~4") do (
if %%a==tf2 (
ssh.exe -t %2 %3 %4 "source /etc/profile && module load tools/singularity/3.8 && singularity shell --nv --contain -B /work -B /home -B /tmp docker://gitlab-registry.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/umamibase-plus:latest-gpu"
) else if %%a==tf1 (
echo "connect with another image"
) else (
ssh.exe %*
)
)
) else (
ssh.exe -V
)
```
### Useful Extensions
| Extension | Mandatory | Explanation |
|-----------|-----------|-------------|
| [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python) | Yes | A Visual Studio Code extension with rich support for the Python language (for all actively supported versions of the language: >=3.6), including features such as IntelliSense (Pylance), linting, debugging, code navigation, code formatting, refactoring, variable explorer, test explorer, and more! |
| [Pylance](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) | Yes (Will be installed with Python extension) | Pylance is an extension that works alongside Python in Visual Studio Code to provide performant language support. Under the hood, Pylance is powered by Pyright, Microsoft's static type checking tool. Using Pyright, Pylance has the ability to supercharge your Python IntelliSense experience with rich type information, helping you write better code faster. |
| [Python Docstring Generator](https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring) | No | Automatically creates a new docstring with all arguments, their types and their default values (if defined in the function head). You just need to fill the descriptions. |
To make full use of VSCode, you can add the following lines to your `settings.json` of VSCode:
```json
"python.formatting.provider": "black",
"editor.formatOnSave": true,
"autoDocstring.docstringFormat": "numpy",
```
The first entry here sets the automated python formatter to use. Like in Umami, you can set this to `black` to have your code auto-formatted. The second
entry enables auto-format on save. So everytime you save, `black` will format your code (style-wise). The third entry set the docstring style used in the
[Python Docstring Generator](https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring). Just press `Ctrl + Shift + 2` (in Linux) below
a function header and the generator will generate a fresh docstring with all arguments, their types and their default values (if defined in the function head) in the `numpy` docstring style (which is used in Umami).
......@@ -10,15 +10,34 @@ A detailed list with the available derivations can be fund in the [FTAG algorith
## Release 22
### Release 22 Samples with Muons and high Statistics
### Release 22 Samples with Lepton, Hadron and Soft Muon Info (p4931)
All information for the GNN are added. Both track selections (Default, Loose) are added by the names tracks and tracks_loose.
| Sample | h5 ntuples | DAOD_PHYSVAL derivations| AOD | TDD hash |
| ------------- | ---------------- | ----------------------- | ---------------- | -------- |
| MC20d - ttbar | user.alfroch.410470.btagTraining.e6337_s3681_r13144_p4856.EMPFlowAll.2021-11-29-T131449-R27984_output.h5 | mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_PHYSVAL.e6337_s3681_r13144_p4856 | mc20_13TeV:mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.recon.AOD.e6337_s3681_r13144 | 95fba671 |
| MC20d - Z' Extended (With QSP, Yes shower weights) | user.alfroch.800030.btagTraining.e7954_s3681_r13144_p4856.EMPFlowAll.2021-12-08-T175903-R25911_output.h5 | mc20_13TeV.800030.Py8EG_A14NNPDF23LO_flatpT_Zprime_Extended.deriv.DAOD_PHYSVAL.e7954_s3681_r13144_p4856 | mc20_13TeV.800030.Py8EG_A14NNPDF23LO_flatpT_Zprime_Extended.recon.AOD.e7954_s3681_r13144 | 95fba671 |
| MC20a - Z' | user.alfroch.427080.btagTraining.e5362_s3681_r13167_p4856.EMPFlowAll.2021-12-08-T175903-R25911_output.h5 | mc20_13TeV.427080.Pythia8EvtGen_A14NNPDF23LO_flatpT_Zprime.deriv.DAOD_PHYSVAL.e5362_s3681_r13167_p4856 | mc20_13TeV.427080.Pythia8EvtGen_A14NNPDF23LO_flatpT_Zprime.recon.AOD.e5362_s3681_r13167 | 95fba671 |
| MC20d - Z' | user.alfroch.427080.btagTraining.e5362_s3681_r13144_p4856.EMPFlowAll.2021-12-08-T175903-R25911_output.h5 | mc20_13TeV.427080.Pythia8EvtGen_A14NNPDF23LO_flatpT_Zprime.deriv.DAOD_PHYSVAL.e5362_s3681_r13144_p4856 | mc20_13TeV.427080.Pythia8EvtGen_A14NNPDF23LO_flatpT_Zprime.recon.AOD.e5362_s3681_r13144 | 95fba671 |
| MC20d - Z' (Herwig 7) | | | mc20_13TeV.500567.MGH7EG_NNPDF23ME_Zprime.recon.AOD.e7954_s3681_r13144 | |
| MC20a - ttbar | user.alfroch.410470.btagTraining.e6337_s3681_r13167_p4931.EMPFlowAll.2022-01-20-T175255_output.h5 | mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_PHYSVAL.e6337_s3681_r13167_p4931 | mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.recon.AOD.e6337_s3681_r13167 | 6268adab |
| MC20d - ttbar | user.alfroch.410470.btagTraining.e6337_s3681_r13144_p4931.EMPFlowAll.2022-01-20-T175255_output.h5 | mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_PHYSVAL.e6337_s3681_r13144_r13146_p4931 | mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.recon.AOD.e6337_s3681_r13144 | 6268adab |
| MC20d - Z' Extended (With QSP, Yes shower weights) | user.alfroch.800030.btagTraining.e7954_s3681_r13144_p4931.EMPFlowAll.2022-01-20-T175255_output.h5 | mc20_13TeV.800030.Py8EG_A14NNPDF23LO_flatpT_Zprime_Extended.deriv.DAOD_PHYSVAL.e7954_s3681_r13144_p4931 | mc20_13TeV.800030.Py8EG_A14NNPDF23LO_flatpT_Zprime_Extended.recon.AOD.e7954_s3681_r13144 | 6268adab |
| MC20a - Z' | user.alfroch.427080.btagTraining.e5362_s3681_r13167_p4931.EMPFlowAll.2022-01-20-T175255_output.h5 | mc20_13TeV.427080.Pythia8EvtGen_A14NNPDF23LO_flatpT_Zprime.deriv.DAOD_PHYSVAL.e5362_s3681_r13167_p4931 | mc20_13TeV.427080.Pythia8EvtGen_A14NNPDF23LO_flatpT_Zprime.recon.AOD.e5362_s3681_r13167 | 6268adab |
| MC20d - Z' | user.alfroch.427080.btagTraining.e5362_s3681_r13144_p4931.EMPFlowAll.2022-01-20-T175255_output.h5 | mc20_13TeV.427080.Pythia8EvtGen_A14NNPDF23LO_flatpT_Zprime.deriv.DAOD_PHYSVAL.e5362_s3681_r13144_p4931 | mc20_13TeV.427080.Pythia8EvtGen_A14NNPDF23LO_flatpT_Zprime.recon.AOD.e5362_s3681_r13144 | 6268adab |
| MC20d - Z' (Herwig 7) | user.alfroch.500567.btagTraining.e7954_s3681_r13144_p4931.EMPFlowAll.2022-01-20-T175255_output.h5 | mc20_13TeV.500567.MGH7EG_NNPDF23ME_Zprime.deriv.DAOD_PHYSVAL.e7954_s3681_r13144_p4931 | mc20_13TeV.500567.MGH7EG_NNPDF23ME_Zprime.recon.AOD.e7954_s3681_r13144 | 6268adab |
??? info "Release 22 Samples with Muons and high Statistics (p4856)"
The round 2 release 22 samples with RNNIP, DL1* and DIPS. Muon information are added (softMuon). Information for GNN training is added. The default and loose track selections are added. Default tracks are called `tracks` and loose tracks are called `tracks_loose`.
| Sample | h5 ntuples | DAOD_PHYSVAL derivations| AOD | TDD hash |
| ------------- | ---------------- | ----------------------- | ---------------- | -------- |
| MC20d - ttbar | user.alfroch.410470.btagTraining.e6337_s3681_r13144_p4856.EMPFlowAll.2021-11-29-T131449-R27984_output.h5 | mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_PHYSVAL.e6337_s3681_r13144_p4856 | mc20_13TeV:mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.recon.AOD.e6337_s3681_r13144 | 95fba671 |
| MC20d - Z' Extended (With QSP, Yes shower weights) | user.alfroch.800030.btagTraining.e7954_s3681_r13144_p4856.EMPFlowAll.2021-12-08-T175903-R25911_output.h5 | mc20_13TeV.800030.Py8EG_A14NNPDF23LO_flatpT_Zprime_Extended.deriv.DAOD_PHYSVAL.e7954_s3681_r13144_p4856 | mc20_13TeV.800030.Py8EG_A14NNPDF23LO_flatpT_Zprime_Extended.recon.AOD.e7954_s3681_r13144 | 95fba671 |
| MC20a - Z' | user.alfroch.427080.btagTraining.e5362_s3681_r13167_p4856.EMPFlowAll.2021-12-08-T175903-R25911_output.h5 | mc20_13TeV.427080.Pythia8EvtGen_A14NNPDF23LO_flatpT_Zprime.deriv.DAOD_PHYSVAL.e5362_s3681_r13167_p4856 | mc20_13TeV.427080.Pythia8EvtGen_A14NNPDF23LO_flatpT_Zprime.recon.AOD.e5362_s3681_r13167 | 95fba671 |
| MC20d - Z' | user.alfroch.427080.btagTraining.e5362_s3681_r13144_p4856.EMPFlowAll.2021-12-08-T175903-R25911_output.h5 | mc20_13TeV.427080.Pythia8EvtGen_A14NNPDF23LO_flatpT_Zprime.deriv.DAOD_PHYSVAL.e5362_s3681_r13144_p4856 | mc20_13TeV.427080.Pythia8EvtGen_A14NNPDF23LO_flatpT_Zprime.recon.AOD.e5362_s3681_r13144 | 95fba671 |
| MC20d - Z' (Herwig 7) | | | mc20_13TeV.500567.MGH7EG_NNPDF23ME_Zprime.recon.AOD.e7954_s3681_r13144 | |
???+ warning "Wrong scores stored for VR track jet taggers"
......
......@@ -6,7 +6,7 @@ Training ntuples are produced using the [training-dataset-dumper](https://gitlab
### Preprocessing
The motivation for preprocessing the training samples results from the fact that the input datasets are highly imbalanced in their flavour composition. While there are large quantities of light jets, the fraction of b-jets is small and the fraction of other flavours is even smaller.
A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and / or adding more examples from the minority class (over-sampling).
In under-sampling, the simplest technique involves removing random records from the majority class, which can cause loss of information.
In under-sampling, the simplest technique involves removing random records from the majority class, which can cause loss of information.
Another approach can be to tell the network how important samples from each class are. For e.g. a majority class you can reduce the impact of samples from this class to the training. You can do this by assigning a weight to each sample and use it to weight the loss function used in the training.
### Hybrid samples
......@@ -54,7 +54,7 @@ For the `HadronConeExclTruthLabelID` labelling, the categories `4` and `44` as w
## Ntuple preparation
The jets used for the training and validation of the taggers are taken from ttbar and Z' events. Different flavours can be used and combined to prepare different datasets for training/evaluation. The standard classes used are `bjets`, `cjets` and `ujets` (light jets).
The jets used for the training and validation of the taggers are taken from ttbar and Z' events. Different flavours can be used and combined to prepare different datasets for training/evaluation. The standard classes used are `bjets`, `cjets` and `ujets` (light jets).
After the ntuple production (training-dataset-dumper) the samples have to be further processed using the Umami [`preprocessing.py`](https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/umami/preprocessing.py) script. The preprocessing script is configured using a dedicated configuration file.
See [`examples/PFlow-Preprocessing.yaml`](https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/examples/PFlow-Preprocessing.yaml) for an example of a preprocessing config file.
......@@ -280,7 +280,7 @@ preparation:
path: *sample_path
file: MC16d-inclusive_testing_zprime_PFlow.h5
```
In the `Preparation`, the size of the batches which are be loaded from the ntuples is defined in `batchsize`. The exact path of the ntuples are defined in `ntuples`. You define there where the ttbar and zprime ntuples are saved and which files to use (You can use wildcards here!). The `file_pattern` defines the files while `path` defines the absolut path to the folder where they are saved. `*ntuple_path` is the path to the ntuples defined in the `parameters` file.
In the `Preparation`, the size of the batches which are be loaded from the ntuples is defined in `batchsize`. The exact path of the ntuples are defined in `ntuples`. You define there where the ttbar and zprime ntuples are saved and which files to use (You can use wildcards here!). The `file_pattern` defines the files while `path` defines the absolut path to the folder where they are saved. `*ntuple_path` is the path to the ntuples defined in the `parameters` file.
The last part is the exact splitting of the flavours. In `samples`, you define for each of ttbar/zprime and training/validation/testing the flavours you want to use. You need to give a type (ttbar/zprime), a category (flavour or `inclusive`) and the number of jets you want for this specific flavour. Also you need to apply the template cuts we defined already. The `f_output` defines where the output files is saved. `path` defines the folder, `file` defines the name.
In the example above, we specify the paths for `ttbar` and `zprime` ntuples. Since we define them there, we can then use these ntuples in the `samples` section. So if you want to use e.g. Z+jets ntuples for bb-jets, define the corresponding `zjets` entry in the ntuples section before using it in the `samples` section.
......@@ -347,8 +347,8 @@ sampling:
# Bool, if track information (for DIPS etc.) are saved.
save_tracks: True
# Name of the track collection to use.
tracks_name: "tracks"
# Name(s) of the track collection(s) to use.
tracks_names: "tracks"
# this stores the indices per sample into an intermediate file
intermediate_index_file: *intermediate_index_file
......@@ -371,17 +371,29 @@ In `sampling`, we can define the method which is used in the preprocessing for r
Another important part are the `class_labels` which are defined here. You can define here which flavours are used in the preprocessing. The name of the available flavours can be find [here](https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/umami/configs/global_config.yaml). Add the names of those to the list here to add them to the preprocessing. **PLEASE KEEP THE ORDERING CONSTANT! THIS IS VERY IMPORTANT**. This list must be the same as the one in the train config!
The `options` are some options for the different resampling methods. You need to define the sampling variables which are used for resampling. For example, if you want to resample in `pt_btagJes` and `absEta_btagJes` bins, you just define them with their respective bins.
The `options` are some options for the different resampling methods. You need to define the sampling variables which are used for resampling. For example, if you want to resample in `pt_btagJes` and `absEta_btagJes` bins, you just define them with their respective bins.
Another thing you need to define are the `samples` which are to be resampled. You need to define them for `ttbar` and `zprime`. The samples defined in here are the ones we prepared in the step above. To ensure a smooth hybrid sample of ttbar and zprime, we need to define some empirically derived values for the ttbar samples in `custom_njets_initial`.
`fractions` gives us the fractions of ttbar and zprime in the final training sample. These values need to add up to 1! The `save_tracks` and the `tracks_name` options define the using of tracks. `save_tracks` is bool while `tracks_name` is a string. The latter is the name of the tracks how they are called in the .h5 files coming from the dumper. After the preparation stage, they will have the name `tracks`. The rest of the variables are pretty self-explanatory.
`fractions` gives us the fractions of ttbar and zprime in the final training sample. These values need to add up to 1! The `save_tracks` and the `tracks_names` options define the using of tracks. `save_tracks` is bool while `tracks_names` is a string or a list of strings. The latter is the name of the tracks how they are called in the .h5 files coming from the dumper, multiple tracks datasets can be preprocessed simultaneously when a list is given. After the preparation stage, they will have the name `tracks`. The rest of the variables are pretty self-explanatory.
If you want to use the PDF sampling, have a look at the example config [PFlow-Preprocessing-taus.yaml](https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/examples/PFlow-Preprocessing-taus.yaml).
For the resampling, the indicies of the jets to use are saved in an intermediate indicies `.h5` file. You can define a name and path in the [Preprocessing-parameters.yaml](https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/examples/Preprocessing-parameters.yaml).
For the weighting method, the last two options are important (otherwise they are not used). The `weighting_target_flavour` defines, to which distribution the weights are relatively calculated. If you want to attach these weights in the final training config, you need to set the `bool_attach_sample_weights` to `True`. For all other resampling methods, this should be `False`.
| Setting | used in method | Explanation |
| ------ | ------ | ------ |
| `sampling_variables` | `all` | Needs exactly 2 variables. Sampling variables which are used for resampling. For example, if you want to resample in `pt_btagJes` and `absEta_btagJes` bins, you just define them with their respective bins. They are defined as a list of dics of the form `[{var_name1:{ bins: <bins>}}, {var_name2:{ bins: <bins>}}]` |
| `samples` | `all` | Samples which are to be resampled. The samples defined in here are the ones we prepared in the step above. |
| `custom_njets_initial` | `count` | Used jets per sample to ensure a smooth hybrid sample of ttbar and zprime, we need to define some empirically derived values for the ttbar samples. |
| `fractions` | `all` | Fractions of used samples in the final training sample. |
| `njets` | | Number of target jets to be taken (through all categories). If set to -1: max out to target numbers (limited by fractions ratio) |
| `save_tracks` | `all` | Flag if storing tracks. |
| `tracks_names` | `all` | Name of the tracks how they are called in the .h5 files coming from the dumper. |
| `intermediate_index_file` | `all` | Stores the indices per sample into an intermediate file. |
| `weighting_target_flavour` | `weighting` | Defines to which distribution the weights are relatively calculated |
| `bool_attach_sample_weights` | `weighting` | If you want to attach these weights in the final training config. For all other resampling methods, this should be `False`. |
### General settings
### General settings
| Setting | Explanation |
| ------ | ---------------- |
......@@ -453,7 +465,7 @@ The steps defined in the following segment are only performed on the training sa
preprocessing.py --config <path to config file> --resampling
```
If you want to also use the tracks of the jets, you need to set the option `save_tracks` in the preprocessing config to `True`. If the tracks have a different name than `"tracks"` in the .h5 files coming from the dumper, you can also set change `tracks_name` to your needs. Track information are not needed for the DL1r but for DIPS and Umami.
If you want to also use the tracks of the jets, you need to set the option `save_tracks` in the preprocessing config to `True`. If the tracks have a different name than `"tracks"` in the .h5 files coming from the dumper, you can also set change `tracks_names` to your needs. Track information are not needed for the DL1r but for DIPS and Umami.
2\. Retrieving scaling and shifting factors:
......@@ -487,4 +499,4 @@ There are several training and validation/test samples to produce. See the follo
## Ntuple Preparation for bb-jets
TODO: Rewrite this!
The double b-jets will be taken from Znunu and Zmumu samples. The framework still requires some updates in order to process those during the hybrid sample creation stage.
The double b-jets will be taken from Znunu and Zmumu samples. The framework still requires some updates in order to process those during the hybrid sample creation stage.
......@@ -61,6 +61,9 @@ var_dict: <path_palce_holder>/umami/umami/configs/Dips_Variables.yaml
exclude: null
# Tracks dataset name
tracks_name: "tracks"
# Values for the neural network
NN_structure:
# Decide, which tagger is used
......
......@@ -61,6 +61,9 @@ var_dict: <path_palce_holder>/umami/umami/configs/Dips_Variables.yaml
exclude: null
# Tracks dataset name
tracks_name: "tracks"
# Values for the neural network
NN_structure:
# Decide, which tagger is used
......
......@@ -242,8 +242,8 @@ sampling:
# Bool, if track information (for DIPS etc.) are saved.
save_tracks: True
# Name of the track collection to use.
tracks_name: "tracks"
# Name(s) of the track collection(s) to use.
tracks_names: ["tracks"]
# this stores the indices per sample into an intermediate file
intermediate_index_file: *intermediate_index_file
......
......@@ -61,6 +61,9 @@ var_dict: <path_palce_holder>/umami/umami-git/umami/configs/Umami_Variables.yaml
exclude: null
# Tracks dataset name
tracks_name: "tracks"
# number of files to be loaded in parallel when using TF Records as input files
nfiles: 5
......
......@@ -56,3 +56,13 @@ doc_string_check:
pylint:
<<: *pylint_template
<<: *linting_rules_template
update_todos:
stage: publish
image: gitlab-registry.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/umamibase-plus:latest
script:
- python pipelines/gitlab-update-todo.py
only:
- master@atlas-flavor-tagging-tools/algorithms/umami
dependencies:
- linter
"""Checks repository in master and updates ToDo issue."""
import fnmatch
import os
from sys import stdout
import gitlab
import yaml
from pylint import epylint as lint
def pylint_fixmes():
"""
Updates issue dedicated to ToDos with all todos found in the code.
Returns
-------
list
pylint_files with Todos
list
pylint_msgs with Todos
"""
(pylint_stdout, _) = lint.py_run("umami/ --disable=all --enable=fixme ", True)
pylint_stdout = pylint_stdout.read()
pylint_files, pylint_msgs = [], []
for line in pylint_stdout.splitlines():
if "umami/" not in line:
continue
file_name, todo_message = line.split(" warning (W0511, fixme, ) ")
if "TODO: " in todo_message:
todo_message = todo_message.replace("TODO: ", "")
elif "TODO " in todo_message:
todo_message = todo_message.replace("TODO ", "")
pylint_files.append(file_name[:-1])
pylint_msgs.append(todo_message)
return pylint_files, pylint_msgs
if __name__ == "__main__":
todo_files, todo_msgs = pylint_fixmes()
issue_description = (
"This issue shows the TODOs specified in the code. "
"It is updated each time the CI in the master branch is running."
"(*Please do not modify the issue description - it will be overwritten*)\n\n"
)
for files, msgs in zip(todo_files, todo_msgs):
issue_description += f"- [ ] {files} - *{msgs}*\n"
print(issue_description)
# connecting to the CERN gitlab API
gl = gitlab.Gitlab(
"https://gitlab.cern.ch",
private_token=os.environ["API_UMAMIBOT_TOKEN"],
)
# specifying the project, in this case umami
project = gl.projects.get("79534")
# issue used to track changes
issue = project.issues.get(120)
# post new issue description
issue.description = issue_description
issue.save()
......@@ -42,7 +42,7 @@ def GetParser():
"--var_dict",
required=True,
type=str,
help="""Dictionary (json) with training variables.""",
help="""Dictionary (yaml) with training variables.""",
)
parser.add_argument(
"-o",
......@@ -100,6 +100,7 @@ class config:
def __init__(self, preprocess_config):
self.dict_file = preprocess_config
self.preparation = {"class_labels": ["ujets", "cjets", "bjets"]}
self.tracks_name = "tracks"
def __run():
......@@ -117,13 +118,13 @@ def __run():
"only one of them needs to be used"
)
training_config = utt.Configuration(args.config)
preprocess_config = upt.Configuration(
training_config.preprocess_config
)
preprocess_config = upt.Configuration(training_config.preprocess_config)
class_labels = training_config.NN_structure["class_labels"]
tracks_name = training_config.tracks_name
elif args.scale_dict is not None:
preprocess_config = config(args.scale_dict)
class_labels = preprocess_config.preparation["class_labels"]
tracks_name = preprocess_config.tracks_name
else:
raise ValueError(
"Missing option, either --config or --scale_dict "
......@@ -139,13 +140,12 @@ def __run():
args.var_dict,
preprocess_config,
class_labels,
tracks_name=tracks_name,
nJets=int(10e6),
exclude=None,
)
logger.info(f"Evaluated jets: {len(Y_test)}")
pred_dips, pred_umami = load_model_umami(
args.model, X_test_trk, X_test_jet
)
pred_dips, pred_umami = load_model_umami(args.model, X_test_trk, X_test_jet)
pred_model = pred_dips if "dips" in args.tagger.lower() else pred_umami
elif "dips" in args.tagger.lower():
......@@ -154,6 +154,7 @@ def __run():
args.var_dict,
preprocess_config,
class_labels,
tracks_name=tracks_name,
nJets=int(10e6),
)
logger.info(f"Evaluated jets: {len(Y_test)}")
......@@ -206,9 +207,7 @@ def __run():
).flatten()
for sampleDiff in sampleDiffs:
df_select = df.query(f"diff>{sampleDiff} and ntrks<{args.ntracks_max}")
diff = round(
len(df_select) / len(df[df["ntrks"] < args.ntracks_max]) * 100, 2
)
diff = round(len(df_select) / len(df[df["ntrks"] < args.ntracks_max]) * 100, 2)
print(f"Differences off {sampleDiff:.1e} {diff}%")
if diff == 0:
break
......
"""Script to determine efficiency working point cut values from tagger scores in input samples."""
from umami.configuration import logger, global_config # isort:skip
from argparse import ArgumentParser
import numpy as np
import umami.train_tools as utt
......
......@@ -3,10 +3,8 @@
import argparse
import json
import yaml
from umami.configuration import logger
from umami.tools import yaml_loader
from umami.preprocessing_tools import GetVariableDict
def GetParser():
......@@ -52,16 +50,24 @@ def GetParser():
default="tracks_ip3d_sd0sort",
help="Track selection name.",
)
parser.add_argument(
"--tracks_name",
type=str,
default="tracks",
help="Tracks dataset name in .h5 training/testing files.",
)
return parser.parse_args()
def GetTrackVariables(scale_dict, variable_config):
noNormVars = variable_config["track_train_variables"]["noNormVars"]
logNormVars = variable_config["track_train_variables"]["logNormVars"]
jointNormVars = variable_config["track_train_variables"]["jointNormVars"]
def GetTrackVariables(scale_dict, variable_config, tracks_name):
noNormVars = variable_config["track_train_variables"][tracks_name]["noNormVars"]
logNormVars = variable_config["track_train_variables"][tracks_name]["logNormVars"]
jointNormVars = variable_config["track_train_variables"][tracks_name][
"jointNormVars"
]
track_dict = scale_dict["tracks"]
track_dict = scale_dict[tracks_name]
track_variables = []
for elem in noNormVars:
v_dict = {}
......@@ -76,6 +82,8 @@ def GetTrackVariables(scale_dict, variable_config):
v_dict["name"] = "log_ptfrac"
elif elem == "dr":
v_dict["name"] = "log_dr_nansafe"
elif elem == "z0RelativeToBeamspotUncertainty":
v_dict["name"] = "log_z0RelativeToBeamspotUncertainty"
else:
raise ValueError(f"{elem} not known in logNormVars. Please check.")
v_dict["offset"] = -1.0 * track_dict[elem]["shift"]
......@@ -122,15 +130,16 @@ def GetJetVariables(scale_dict, variable_config):
def __run():
"""main part of script generating json file"""
args = GetParser()
with open(args.var_dict, "r") as conf:
variable_config = yaml.load(conf, Loader=yaml_loader)
variable_config = GetVariableDict(args.var_dict)
if "dips" in args.tagger.lower():
logger.info("Starting processing DIPS variables.")
with open(args.scale_dict, "r") as f:
scale_dict = json.load(f)
track_variables = GetTrackVariables(scale_dict, variable_config)
track_variables = GetTrackVariables(
scale_dict, variable_config, args.tracks_name
)
logger.info("Found %i variables" % len(track_variables))
inputs = {}
......@@ -174,9 +183,7 @@ def __run():
logger.info("Detected tau output in tagger.")
labels_tau = ["pu", "pc", "pb", "ptau"]
logger.info(f"Using labels {labels_tau}")
lwtnn_var_dict["outputs"] = [
{"labels": labels_tau, "name": args.tagger}
]
lwtnn_var_dict["outputs"] = [{"labels": labels_tau, "name": args.tagger}]
else:
lwtnn_var_dict["outputs"] = [
{"labels": ["pu", "pc", "pb"], "name": args.tagger}
......
......@@ -66,7 +66,8 @@ custom_defaults_vars:
JetFitterSecondaryVertex_nTracks: 0
JetFitterSecondaryVertex_energyFraction: 0
track_train_variables:
# Standard tracks training variables
.tracks_variables: &tracks_variables
noNormVars:
- IP3D_signed_d0_significance
- IP3D_signed_z0_significance
......@@ -85,3 +86,9 @@ track_train_variables:
- numberOfSCTHits
- btagIp_d0
- btagIp_z0SinTheta
track_train_variables:
tracks:
<<: *tracks_variables
tracks_loose:
<<: *tracks_variables