Add categorized and late-splitting (datadriven-like) selection for subprocesses
Long on the to-look-into list, but I got back to it after Wednesday's discussion: splitting samples in "subprocesses" efficiently (discussed a bit in #2 ).
Conceptually the data-driven background mechanism works well for this (one sample -> multiple sets of histograms, each for a different selection that is a variation of the main one), but since that would create N copies of the basically the same RDF graph to a few Filter nodes right at the beginning, that's not great performance-wise.
The solution I thought of is a Selection
-like object LateSplittingSelection
that is similar to SelectionWithDataDriven
(now both inheriting from the same base class, such that code elsewhere to support this can be shared), but:
- only creates the per-subprocess
Selection
s when plots are attached (the idea is that this is cheap, so having it a few times for the selections that use it is more optimal than everywhere from the start is) - keeps the "inclusive" selection plots as well (this could be removed, but it may actually help performance because of sharing between the different sub-selections).
Constructing such a selection is done with a dictionary of cuts (weights could be added if needed, but are not there yet) that define the subprocesses:
if (MC and should be split):
genHardElectrons = op.select(t.GenPart, lambda gp : op.AND((gp.statusFlags & (0x1<<7)), op.abs(gp.pdgId) == 11))
noSel = LateSplittingSelection.create(noSel, "splitByGenEl", splitCuts={
"2El": op.rng_len(genHardElectrons) == 2,
"no2El": op.rng_len(genHardElectrons) != 2
})
The histograms are written to {sampleName}{suffix}.root
(could be changed but seems generic enough).
For the configuration I think it's mostly a matter of adding a few renamed clones to the samples list (could be a few lines in the postprocess method). We could add another plotter base module with this, or put them in a recipe in the documentation... no strong opinion from my side.
Also in this PR: a fix for data-driven backgrounds and subprocesses with the lazy backend (the histograms were simply not being produced), and a selection-like object that makes a group of similar selections behave as much as possible like a signel one (each category has its selection and a "candidate", e.g. an electron in one and muon in another case - I've found this useful for ttW).
Still to be done: improved test coverage to avoid regressions like the one in https://gitlab.cern.ch/cp3-cms/bamboo/-/merge_requests/188, which I noticed while testing this, happening again (I split the DY M(e+e-) into truth e+e- and the rest, which should have some tau+tau-, for now, but maybe binning by jet multiplicity would be better), and documentation.
Not tested: combining selections with data-driven components and late-splitting... I don't immediately see use cases for this, but we could add a warning to make clear that that is not supported (although the main thing is probably configuration, getting the histogram filling part to work is probably not too hard).