Add categorized and late-splitting (datadriven-like) selection for subprocesses (!189) · Merge requests · cms-analysis / General / bamboo

Pieter David requested to merge piedavid/bamboo:augmentedselections into master May 09, 2021

Long on the to-look-into list, but I got back to it after Wednesday's discussion: splitting samples in "subprocesses" efficiently (discussed a bit in #2 ).

Conceptually the data-driven background mechanism works well for this (one sample -> multiple sets of histograms, each for a different selection that is a variation of the main one), but since that would create N copies of the basically the same RDF graph to a few Filter nodes right at the beginning, that's not great performance-wise.

The solution I thought of is a Selection-like object LateSplittingSelection that is similar to SelectionWithDataDriven (now both inheriting from the same base class, such that code elsewhere to support this can be shared), but:

only creates the per-subprocess Selections when plots are attached (the idea is that this is cheap, so having it a few times for the selections that use it is more optimal than everywhere from the start is)
keeps the "inclusive" selection plots as well (this could be removed, but it may actually help performance because of sharing between the different sub-selections).

Constructing such a selection is done with a dictionary of cuts (weights could be added if needed, but are not there yet) that define the subprocesses:

if (MC and should be split):
    genHardElectrons = op.select(t.GenPart, lambda gp : op.AND((gp.statusFlags & (0x1<<7)), op.abs(gp.pdgId) == 11))
    noSel = LateSplittingSelection.create(noSel, "splitByGenEl", splitCuts={
        "2El": op.rng_len(genHardElectrons) == 2,
        "no2El": op.rng_len(genHardElectrons) != 2
        })

The histograms are written to {sampleName}{suffix}.root (could be changed but seems generic enough). For the configuration I think it's mostly a matter of adding a few renamed clones to the samples list (could be a few lines in the postprocess method). We could add another plotter base module with this, or put them in a recipe in the documentation... no strong opinion from my side.

Also in this PR: a fix for data-driven backgrounds and subprocesses with the lazy backend (the histograms were simply not being produced), and a selection-like object that makes a group of similar selections behave as much as possible like a signel one (each category has its selection and a "candidate", e.g. an electron in one and muon in another case - I've found this useful for ttW).

Still to be done: improved test coverage to avoid regressions like the one in https://gitlab.cern.ch/cp3-cms/bamboo/-/merge_requests/188, which I noticed while testing this, happening again (I split the DY M(e+e-) into truth e+e- and the rest, which should have some tau+tau-, for now, but maybe binning by jet multiplicity would be better), and documentation.

Not tested: combining selections with data-driven components and late-splitting... I don't immediately see use cases for this, but we could add a warning to make clear that that is not supported (although the main thing is probably configuration, getting the histogram filling part to work is probably not too hard).

Edited May 18, 2021 by Pieter David

Add categorized and late-splitting (datadriven-like) selection for subprocesses

Merge request reports