Commit 5194d101 authored by Manuel Guth's avatar Manuel Guth
Browse files

Merge branch 'track-selection-prep' into 'master'

Multiple Tracks datasets in preprocessing stage

See merge request atlas-flavor-tagging-tools/algorithms/umami!285
parents eea69d23 7c708eb9
......@@ -29,4 +29,4 @@ Preprocessing-parameters-*.yaml
preprocessing_*/
test_train_*/
# ignoring any test directory
test-*/
\ No newline at end of file
test-*/
......@@ -196,6 +196,7 @@ The different options are briefly explained here:
| `zpext_test_files` | Dict | Optional | Here you can define different zpext test samples that are used in the [`evaluate_model.py`](https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/umami/evaluate_model.py). Those test samples need to be defined in a dict structure shown in the example. The name of the dict entry is irrelevant while the `Path` and `data_set_name` are important. The `data_set_name` needs to be unique. Its the identifier/name of the dataset in the evaluation file which is used for plotting. For test samples, all samples from the training-dataset-dumper can be used without preprocessing although the preprocessing of Umami produces test samples to ensure orthogonality of the jets with respect to the train sample. |
| `var_dict` | String | Necessary | Path to the variable dict used in the `preprocess_config` to produce the train sample. |
| `exclude` | List | Necessary | List of variables that are excluded from training. Only compatible with DL1r training. To include all, just give an empty list. |
|`tracks_name`| String| Necessary* | Name of the tracks data-set to use for training and evaluation, default is "tracks". <br />* ***This option is necessary when using tracks, but, when working with old preprpocessed files (before January 2022) this option has to be removed form the config file to ensure compatibility*** |
| `NN_structure` | None | Necessary | A dict where all important information for the training are defined. |
| `tagger` | String | Necessary | Name of the tagger that is used/to be trained. |
| `lr` | Float | Necessary | Learning rate which is used for training. |
......
......@@ -347,8 +347,8 @@ sampling:
# Bool, if track information (for DIPS etc.) are saved.
save_tracks: True
# Name of the track collection to use.
tracks_name: "tracks"
# Name(s) of the track collection(s) to use.
tracks_names: "tracks"
# this stores the indices per sample into an intermediate file
intermediate_index_file: *intermediate_index_file
......@@ -373,7 +373,7 @@ Another important part are the `class_labels` which are defined here. You can de
The `options` are some options for the different resampling methods. You need to define the sampling variables which are used for resampling. For example, if you want to resample in `pt_btagJes` and `absEta_btagJes` bins, you just define them with their respective bins.
Another thing you need to define are the `samples` which are to be resampled. You need to define them for `ttbar` and `zprime`. The samples defined in here are the ones we prepared in the step above. To ensure a smooth hybrid sample of ttbar and zprime, we need to define some empirically derived values for the ttbar samples in `custom_njets_initial`.
`fractions` gives us the fractions of ttbar and zprime in the final training sample. These values need to add up to 1! The `save_tracks` and the `tracks_name` options define the using of tracks. `save_tracks` is bool while `tracks_name` is a string. The latter is the name of the tracks how they are called in the .h5 files coming from the dumper. After the preparation stage, they will have the name `tracks`. The rest of the variables are pretty self-explanatory.
`fractions` gives us the fractions of ttbar and zprime in the final training sample. These values need to add up to 1! The `save_tracks` and the `tracks_names` options define the using of tracks. `save_tracks` is bool while `tracks_names` is a string or a list of strings. The latter is the name of the tracks how they are called in the .h5 files coming from the dumper, multiple tracks datasets can be preprocessed simultaneously when a list is given. After the preparation stage, they will have the name `tracks`. The rest of the variables are pretty self-explanatory.
If you want to use the PDF sampling, have a look at the example config [PFlow-Preprocessing-taus.yaml](https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/examples/PFlow-Preprocessing-taus.yaml).
For the resampling, the indicies of the jets to use are saved in an intermediate indicies `.h5` file. You can define a name and path in the [Preprocessing-parameters.yaml](https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/examples/Preprocessing-parameters.yaml).
......@@ -387,7 +387,7 @@ For the resampling, the indicies of the jets to use are saved in an intermediate
| `fractions` | `all` | Fractions of used samples in the final training sample. |
| `njets` | | Number of target jets to be taken (through all categories). If set to -1: max out to target numbers (limited by fractions ratio) |
| `save_tracks` | `all` | Flag if storing tracks. |
| `tracks_name` | `all` | Name of the tracks how they are called in the .h5 files coming from the dumper. |
| `tracks_names` | `all` | Name of the tracks how they are called in the .h5 files coming from the dumper. |
| `intermediate_index_file` | `all` | Stores the indices per sample into an intermediate file. |
| `weighting_target_flavour` | `weighting` | Defines to which distribution the weights are relatively calculated |
| `bool_attach_sample_weights` | `weighting` | If you want to attach these weights in the final training config. For all other resampling methods, this should be `False`. |
......@@ -465,7 +465,7 @@ The steps defined in the following segment are only performed on the training sa
preprocessing.py --config <path to config file> --resampling
```
If you want to also use the tracks of the jets, you need to set the option `save_tracks` in the preprocessing config to `True`. If the tracks have a different name than `"tracks"` in the .h5 files coming from the dumper, you can also set change `tracks_name` to your needs. Track information are not needed for the DL1r but for DIPS and Umami.
If you want to also use the tracks of the jets, you need to set the option `save_tracks` in the preprocessing config to `True`. If the tracks have a different name than `"tracks"` in the .h5 files coming from the dumper, you can also set change `tracks_names` to your needs. Track information are not needed for the DL1r but for DIPS and Umami.
2\. Retrieving scaling and shifting factors:
......
......@@ -39,6 +39,9 @@ var_dict: <path_palce_holder>/umami/umami/configs/Dips_Variables.yaml
exclude: null
# Tracks dataset name
tracks_name: "tracks"
# Values for the neural network
NN_structure:
# Decide, which tagger is used
......
......@@ -39,6 +39,9 @@ var_dict: <path_palce_holder>/umami/umami/configs/Dips_Variables.yaml
exclude: null
# Tracks dataset name
tracks_name: "tracks"
# Values for the neural network
NN_structure:
# Decide, which tagger is used
......
......@@ -242,8 +242,8 @@ sampling:
# Bool, if track information (for DIPS etc.) are saved.
save_tracks: True
# Name of the track collection to use.
tracks_name: "tracks"
# Name(s) of the track collection(s) to use.
tracks_names: ["tracks"]
# this stores the indices per sample into an intermediate file
intermediate_index_file: *intermediate_index_file
......
......@@ -31,6 +31,9 @@ var_dict: <path_palce_holder>/umami/umami-git/umami/configs/Umami_Variables.yaml
exclude: null
# Tracks dataset name
tracks_name: "tracks"
# number of files to be loaded in parallel when using TF Records as input files
nfiles: 5
......
......@@ -42,7 +42,7 @@ def GetParser():
"--var_dict",
required=True,
type=str,
help="""Dictionary (json) with training variables.""",
help="""Dictionary (yaml) with training variables.""",
)
parser.add_argument(
"-o",
......@@ -100,6 +100,7 @@ class config:
def __init__(self, preprocess_config):
self.dict_file = preprocess_config
self.preparation = {"class_labels": ["ujets", "cjets", "bjets"]}
self.tracks_name = "tracks"
def __run():
......@@ -117,13 +118,13 @@ def __run():
"only one of them needs to be used"
)
training_config = utt.Configuration(args.config)
preprocess_config = upt.Configuration(
training_config.preprocess_config
)
preprocess_config = upt.Configuration(training_config.preprocess_config)
class_labels = training_config.NN_structure["class_labels"]
tracks_name = training_config.tracks_name
elif args.scale_dict is not None:
preprocess_config = config(args.scale_dict)
class_labels = preprocess_config.preparation["class_labels"]
tracks_name = preprocess_config.tracks_name
else:
raise ValueError(
"Missing option, either --config or --scale_dict "
......@@ -139,13 +140,12 @@ def __run():
args.var_dict,
preprocess_config,
class_labels,
tracks_name=tracks_name,
nJets=int(10e6),
exclude=None,
)
logger.info(f"Evaluated jets: {len(Y_test)}")
pred_dips, pred_umami = load_model_umami(
args.model, X_test_trk, X_test_jet
)
pred_dips, pred_umami = load_model_umami(args.model, X_test_trk, X_test_jet)
pred_model = pred_dips if "dips" in args.tagger.lower() else pred_umami
elif "dips" in args.tagger.lower():
......@@ -154,6 +154,7 @@ def __run():
args.var_dict,
preprocess_config,
class_labels,
tracks_name=tracks_name,
nJets=int(10e6),
)
logger.info(f"Evaluated jets: {len(Y_test)}")
......@@ -206,9 +207,7 @@ def __run():
).flatten()
for sampleDiff in sampleDiffs:
df_select = df.query(f"diff>{sampleDiff} and ntrks<{args.ntracks_max}")
diff = round(
len(df_select) / len(df[df["ntrks"] < args.ntracks_max]) * 100, 2
)
diff = round(len(df_select) / len(df[df["ntrks"] < args.ntracks_max]) * 100, 2)
print(f"Differences off {sampleDiff:.1e} {diff}%")
if diff == 0:
break
......
"""Script to determine efficiency working point cut values from tagger scores in input samples."""
from umami.configuration import logger, global_config # isort:skip
from argparse import ArgumentParser
import numpy as np
import umami.train_tools as utt
......
......@@ -3,10 +3,8 @@
import argparse
import json
import yaml
from umami.configuration import logger
from umami.tools import yaml_loader
from umami.preprocessing_tools import GetVariableDict
def GetParser():
......@@ -52,16 +50,24 @@ def GetParser():
default="tracks_ip3d_sd0sort",
help="Track selection name.",
)
parser.add_argument(
"--tracks_name",
type=str,
default="tracks",
help="Tracks dataset name in .h5 training/testing files.",
)
return parser.parse_args()
def GetTrackVariables(scale_dict, variable_config):
noNormVars = variable_config["track_train_variables"]["noNormVars"]
logNormVars = variable_config["track_train_variables"]["logNormVars"]
jointNormVars = variable_config["track_train_variables"]["jointNormVars"]
def GetTrackVariables(scale_dict, variable_config, tracks_name):
noNormVars = variable_config["track_train_variables"][tracks_name]["noNormVars"]
logNormVars = variable_config["track_train_variables"][tracks_name]["logNormVars"]
jointNormVars = variable_config["track_train_variables"][tracks_name][
"jointNormVars"
]
track_dict = scale_dict["tracks"]
track_dict = scale_dict[tracks_name]
track_variables = []
for elem in noNormVars:
v_dict = {}
......@@ -76,6 +82,8 @@ def GetTrackVariables(scale_dict, variable_config):
v_dict["name"] = "log_ptfrac"
elif elem == "dr":
v_dict["name"] = "log_dr_nansafe"
elif elem == "z0RelativeToBeamspotUncertainty":
v_dict["name"] = "log_z0RelativeToBeamspotUncertainty"
else:
raise ValueError(f"{elem} not known in logNormVars. Please check.")
v_dict["offset"] = -1.0 * track_dict[elem]["shift"]
......@@ -122,15 +130,16 @@ def GetJetVariables(scale_dict, variable_config):
def __run():
"""main part of script generating json file"""
args = GetParser()
with open(args.var_dict, "r") as conf:
variable_config = yaml.load(conf, Loader=yaml_loader)
variable_config = GetVariableDict(args.var_dict)
if "dips" in args.tagger.lower():
logger.info("Starting processing DIPS variables.")
with open(args.scale_dict, "r") as f:
scale_dict = json.load(f)
track_variables = GetTrackVariables(scale_dict, variable_config)
track_variables = GetTrackVariables(
scale_dict, variable_config, args.tracks_name
)
logger.info("Found %i variables" % len(track_variables))
inputs = {}
......@@ -174,9 +183,7 @@ def __run():
logger.info("Detected tau output in tagger.")
labels_tau = ["pu", "pc", "pb", "ptau"]
logger.info(f"Using labels {labels_tau}")
lwtnn_var_dict["outputs"] = [
{"labels": labels_tau, "name": args.tagger}
]
lwtnn_var_dict["outputs"] = [{"labels": labels_tau, "name": args.tagger}]
else:
lwtnn_var_dict["outputs"] = [
{"labels": ["pu", "pc", "pb"], "name": args.tagger}
......
......@@ -66,7 +66,8 @@ custom_defaults_vars:
JetFitterSecondaryVertex_nTracks: 0
JetFitterSecondaryVertex_energyFraction: 0
track_train_variables:
# Standard tracks training variables
.tracks_variables: &tracks_variables
noNormVars:
- IP3D_signed_d0_significance
- IP3D_signed_z0_significance
......@@ -85,3 +86,9 @@ track_train_variables:
- numberOfSCTHits
- btagIp_d0
- btagIp_z0SinTheta
track_train_variables:
tracks:
<<: *tracks_variables
tracks_loose:
<<: *tracks_variables
label: HadronConeExclTruthLabelID
train_variables:
JetKinematics:
- absEta_btagJes
- pt_btagJes
JetFitter:
- JetFitter_isDefaults
- JetFitter_mass
- JetFitter_energyFraction
- JetFitter_significance3d
- JetFitter_nVTX
- JetFitter_nSingleTracks
- JetFitter_nTracksAtVtx
- JetFitter_N2Tpair
- JetFitter_deltaR
JetFitterSecondaryVertex:
- JetFitterSecondaryVertex_isDefaults
- JetFitterSecondaryVertex_nTracks
- JetFitterSecondaryVertex_mass
- JetFitterSecondaryVertex_energy
- JetFitterSecondaryVertex_energyFraction
- JetFitterSecondaryVertex_displacement3d
- JetFitterSecondaryVertex_displacement2d
- JetFitterSecondaryVertex_maximumAllJetTrackRelativeEta # Modified name in R22. Was: maximumTrackRelativeEta
- JetFitterSecondaryVertex_minimumAllJetTrackRelativeEta # Modified name in R22. Was: minimumTrackRelativeEta
- JetFitterSecondaryVertex_averageAllJetTrackRelativeEta # Modified name in R22. Was: averageTrackRelativeEta
SV1:
- SV1_isDefaults
- SV1_NGTinSvx
- SV1_masssvx
- SV1_N2Tpair
- SV1_efracsvx
- SV1_deltaR
- SV1_Lxy
- SV1_L3d
- SV1_significance3d
IP2D:
- IP2D_isDefaults
- IP2D_bu
- IP2D_bc
- IP2D_cu
IP3D:
- IP3D_isDefaults
- IP3D_bu
- IP3D_bc
- IP3D_cu
# useful variables which might want to be kept but being used for training
spectator_variables:
- DL1r_pb
- DL1r_pu
custom_defaults_vars:
JetFitter_energyFraction: 0
JetFitter_significance3d: 0
JetFitter_nVTX: -1
JetFitter_nSingleTracks: -1
JetFitter_nTracksAtVtx: -1
JetFitter_N2Tpair: -1
SV1_N2Tpair: -1
SV1_NGTinSvx: -1
SV1_efracsvx: 0
JetFitterSecondaryVertex_nTracks: 0
JetFitterSecondaryVertex_energyFraction: 0
# Standard tracks training variables
.tracks_variables: &tracks_variables
noNormVars:
- IP3D_signed_d0_significance
- IP3D_signed_z0_significance
- numberOfInnermostPixelLayerHits
- numberOfNextToInnermostPixelLayerHits
- numberOfInnermostPixelLayerSharedHits
- numberOfInnermostPixelLayerSplitHits
- numberOfPixelSharedHits
- numberOfPixelSplitHits
- numberOfSCTSharedHits
logNormVars:
- ptfrac
- dr
jointNormVars:
- numberOfPixelHits
- numberOfSCTHits
- btagIp_d0
- btagIp_z0SinTheta
track_train_variables:
tracks:
<<: *tracks_variables
tracks_loose:
<<: *tracks_variables
label: HadronConeExclTruthLabelID
track_labels:
tracks_labels:
- truthOriginLabel
- truthVertexIndex
......@@ -10,7 +10,7 @@ train_variables:
- pt_btagJes
- energy
track_train_variables:
.tracks_variables: &tracks_variables
noNormVars: []
logNormVars: []
jointNormVars:
......@@ -39,4 +39,10 @@ track_train_variables:
#- ambiRank
#- chiSquaredOverNumberDoF
track_train_variables:
tracks:
<<: *tracks_variables
custom_defaults_vars:
......@@ -66,7 +66,7 @@ custom_defaults_vars:
JetFitterSecondaryVertex_nTracks: 0
JetFitterSecondaryVertex_energyFraction: 0
track_train_variables:
.tracks_variables: &tracks_variables
noNormVars:
- IP3D_signed_d0_significance
- IP3D_signed_z0_significance
......@@ -85,3 +85,8 @@ track_train_variables:
- numberOfSCTHits
- btagIp_d0
- btagIp_z0SinTheta
track_train_variables:
tracks:
<<: *tracks_variables
label: HadronConeExclTruthLabelID
train_variables:
JetKinematics:
- absEta_btagJes
- pt_btagJes
JetFitter:
- JetFitter_isDefaults
- JetFitter_mass
- JetFitter_energyFraction
- JetFitter_significance3d
- JetFitter_nVTX
- JetFitter_nSingleTracks
- JetFitter_nTracksAtVtx
- JetFitter_N2Tpair
- JetFitter_deltaR
JetFitterSecondaryVertex:
- JetFitterSecondaryVertex_isDefaults
- JetFitterSecondaryVertex_nTracks
- JetFitterSecondaryVertex_mass
- JetFitterSecondaryVertex_energy
- JetFitterSecondaryVertex_energyFraction
- JetFitterSecondaryVertex_displacement3d
- JetFitterSecondaryVertex_displacement2d
- JetFitterSecondaryVertex_maximumAllJetTrackRelativeEta # Modified name in R22. Was: maximumTrackRelativeEta
- JetFitterSecondaryVertex_minimumAllJetTrackRelativeEta # Modified name in R22. Was: minimumTrackRelativeEta
- JetFitterSecondaryVertex_averageAllJetTrackRelativeEta # Modified name in R22. Was: averageTrackRelativeEta
SV1:
- SV1_isDefaults
- SV1_NGTinSvx
- SV1_masssvx
- SV1_N2Tpair
- SV1_efracsvx
- SV1_deltaR
- SV1_Lxy
- SV1_L3d
- SV1_significance3d
IP2D:
- IP2D_isDefaults
- IP2D_bu
- IP2D_bc
- IP2D_cu
IP3D:
- IP3D_isDefaults
- IP3D_bu
- IP3D_bc
- IP3D_cu
# useful variables which might want to be kept but being used for training
spectator_variables:
- DL1r_pb
- DL1r_pu
custom_defaults_vars:
JetFitter_energyFraction: 0
JetFitter_significance3d: 0
JetFitter_nVTX: -1
JetFitter_nSingleTracks: -1
JetFitter_nTracksAtVtx: -1
JetFitter_N2Tpair: -1
SV1_N2Tpair: -1
SV1_NGTinSvx: -1
SV1_efracsvx: 0
JetFitterSecondaryVertex_nTracks: 0
JetFitterSecondaryVertex_energyFraction: 0
.tracks_variables: &tracks_variables
noNormVars:
- IP3D_signed_d0_significance
- IP3D_signed_z0_significance
- numberOfInnermostPixelLayerHits
- numberOfNextToInnermostPixelLayerHits
- numberOfInnermostPixelLayerSharedHits
- numberOfInnermostPixelLayerSplitHits
- numberOfPixelSharedHits
- numberOfPixelSplitHits
- numberOfSCTSharedHits
logNormVars:
- ptfrac
- dr
jointNormVars:
- numberOfPixelHits
- numberOfSCTHits
- btagIp_d0
- btagIp_z0SinTheta
track_train_variables:
tracks:
<<: *tracks_variables
tracks_loose:
<<: *tracks_variables
......@@ -227,6 +227,7 @@ def LoadTrksFromFile(
filepath: str,
class_labels: list,
nJets: int,
tracks_name: str = "tracks",
cut_vars_dict: dict = None,
print_logger: bool = True,
chunk_size: int = 1e6,
......@@ -242,6 +243,8 @@ def LoadTrksFromFile(
List of class labels which are used.
nJets : int
Number of jets to load.
tracks_name : str
Name of the tracks collection to load
cut_vars_dict : dict
Variable cuts that are applied when loading the jets.
print_logger : bool
......@@ -393,7 +396,7 @@ def LoadTrksFromFile(
# Load tracks and delete unused classes
trks = np.delete(
arr=np.asarray(
h5py.File(file, "r")["/tracks"][
h5py.File(file, "r")[f"/{tracks_name}"][
infile_counter * chunk_size : (infile_counter + 1) * chunk_size
]
),
......
......@@ -114,6 +114,7 @@ def EvaluateModel(
class_labels = train_config.NN_structure["class_labels"]
main_class = train_config.NN_structure["main_class"]
frac_values_comp = Eval_params["frac_values_comp"]
tracks_name = train_config.tracks_name
var_cuts = (
Eval_params["variable_cuts"][f"{data_set_name}"]
if "variable_cuts" in Eval_params and Eval_params["variable_cuts"] is not None
......@@ -175,6 +176,7 @@ def EvaluateModel(
var_dict=train_config.var_dict,
preprocess_config=preprocess_config,
class_labels=class_labels,
tracks_name=tracks_name,
nJets=nJets,
exclude=exclude,
cut_vars_dict=var_cuts,
......@@ -336,6 +338,7 @@ def EvaluateModelDips(
class_labels = train_config.NN_structure["class_labels"]
main_class = train_config.NN_structure["main_class"]
frac_values_comp = Eval_params["frac_values_comp"]
tracks_name = train_config.tracks_name
var_cuts = (
Eval_params["variable_cuts"][f"{data_set_name}"]
if "variable_cuts" in Eval_params and Eval_params["variable_cuts"] is not None
......@@ -372,6 +375,7 @@ def EvaluateModelDips(
var_dict=train_config.var_dict,
preprocess_config=preprocess_config,
class_labels=class_labels,
tracks_name=tracks_name,
nJets=nJets,
cut_vars_dict=var_cuts,
jet_variables=[
......@@ -391,6 +395,7 @@ def EvaluateModelDips(
var_dict=train_config.var_dict,
preprocess_config=preprocess_config,
class_labels=class_labels,
tracks_name=tracks_name,
nJets=nJets,
cut_vars_dict=var_cuts,
)
......
......@@ -8,13 +8,13 @@ from glob import glob
import matplotlib as mtp
import matplotlib.pyplot as plt
import numpy as np
import yaml
from matplotlib import gridspec
import umami.data_tools as udt
from umami.configuration import global_config, logger
from umami.helper_tools import hist_ratio, hist_w_unc
from umami.tools import applyATLASstyle, makeATLAStag, natural_keys, yaml_loader
from umami.preprocessing_tools import GetVariableDict
from umami.tools import applyATLASstyle, makeATLAStag, natural_keys
def check_kwargs_var_plots(kwargs: dict, **custom_default):
......@@ -460,6 +460,7 @@ def plot_input_vars_trks_comparison(
output_directory: str = "input_vars_trks",
Ratio_Cut: list = None,
track_origin: str = "All",
tracks_name: str = "tracks",
**kwargs,
):
"""
......@@ -489,6 +490,8 @@ def plot_input_vars_trks_comparison(
List of y-axis cuts for the ratio block.
track_origin: str
Track set that is to be used for plotting.
tracks_name : str
Track collection to use, default is 'tracks'.
**kwargs: dict
- plot_type : str
Plottype, like pdf or png
......@@ -593,16 +596,19 @@ def plot_input_vars_trks_comparison(
flavour_label_dict.update({label: flavour_labels})
# Load var dict
with open(var_dict, "r") as conf:
variable_config = yaml.load(conf, Loader=yaml_loader)
variable_config = GetVariableDict(var_dict)
# Loading track variables
try:
trksVars = variable_config["tracks"]
except KeyError:
noNormVars = variable_config["track_train_variables"]["noNormVars"]