Skip to content

Allow merging in Categorization via TChain

MR Description

This MR introduces the possibility of merging pNTuples already at the Categorization step.

The Merging is tuned by the categorization_merging parameter that can be specified for each Dataset. This parameter works like the already existing merging parameter and has to be defined as a dictionary, with one entry per category.

E.G.

Dataset("TTto2L2Nu",
        dataset=[
            "/TTto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5-v2/NANOAODSIM",
            "/TTto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5_ext1-v2/NANOAODSIM"
        ],
        process=self.processes.get("TTto2L2Nu"),
        categorization_merging={"base": 20, "baseline": 20, "baseline_bResolved": 20, "baseline_bBoosted": 20, "mutau": 20, "etau": 20, "tautau": 20,
                                "resolved_1b": 20, "resolved_2b": 20, "boosted": 20, "vbf_loose": 20, "vbf_tight": 20, "vbf": 20, "ttbar_invertedMassCR": 20},
        merging={"base": 10, "baseline": 10, "baseline_bResolved": 10, "baseline_bBoosted": 10, "mutau": 10, "etau": 10, "tautau": 10,
                 "resolved_1b": 10, "resolved_2b": 10, "boosted": 10, "vbf_loose": 10, "vbf_tight": 10, "vbf": 10, "ttbar_invertedMassCR": 10},
        xs=98.036113,
        tags=["NanoAODv12"]),

When the Categorization step is run, the categorization_merging is used to define how many output files will be produced. If the dataset has a total of N files (hence N pNTuples are produced by PreprocessRDF), then each output file of Categorization will contain N // 20 pNTuples outputs (modulo the remainder of the division). This is achieved by running Categorization on a TChain of pNTuples instead of one by one.

The original implementation was done by @tcuisset, I merely looked at what he did and ported it to the current version of NBA. Thanks Theo!

MR Validation

The MR has been validated running the commands in this script tmp.sh as well as running several tests including also the categorization_max_events parameter for the dataset used, and running with the --skip-preprocess and --skip-processing options.

Merge request reports

Loading