Refactoring for incremental runs (!195) · Merge requests · cms-analysis / General / bamboo

Pieter David requested to merge piedavid/bamboo:incremental_1 into master Jul 01, 2021

Not finished yet, but getting there, so pushed such that we can discuss with something that's (close to) ready for testing, what remains is the integration / user-interface part.

Update on merging: propose to test and review just the refactoring, and implement the actual incremental running in a new PR. A preview of that can be seen here.

A summary of the lower-level changes in this PR:

comparing histograms needs a few ingredients: the subsample (original sample and Filters applied to it), axis variables, weight, and binning. bamboo internally uses the hashes of expressions for a number of optimisations, but the standard hash method uses randomisation as a security feature. Therefore the first set of changes is a more generic way to calculate expression hashes that do not change between runs (based on the blake2 algorithm) - the only other difference is that hashes for equality checks in bamboo need to be aware of systematics (such that we don't think the with-systematics variable is the same as without), but for histogram comparison the nominal does not depend on the presence of variations.
for the "(sub)sample hash" the cuts expressions(s) hashes can be used (the weight is treated as another axis variable). For the sample the name is used.
[descoped] the hash for each (outputfile, histogram) are stored in a small database file in the results directory. If the --previous-output option is set and the hash is found and equal, the histogram will be skipped.

The biggest change is that, to be efficient at only producing some histograms, there must be a way to delay construction of the RDF graph until the list of all histograms that could be produced is constructed, and it is checked which are needed. The existing backends did not allow for this (the lazy backend kept the list of plots, but not the list of all histograms), so this needed a bit of refactoring. A side-effect is that implementing a lazy backend (which builds RDF graph all at once, after getting the results from definePlots) is much more natural (and better supported), so the new default. The old behaviour (constructing the necessary RDF nodes with each plot) is easy to recover, so that's the "debug" backend now (they were already much closer now than when the lazy backend was first added). Good to know: the backend keeps a dictionary of "products" for each plot, which are python handles for each plot, but getResults must be called to get the actual ROOT objects (as before, but with one more layer of indirection internally).

This all sounds like a larger change than it turns out to be at the code level, but the consequences are big: this should allow to construct the plot (and histogram) lists for all samples and compile the minimal number of compiled worker backends, use distributed RDF, or pass everything to some other query system.

[descoped] Open items:

detect changes in sample definition (for the same name)
plotIt will fail if some histograms are there, so there must be a way to reduce the plot list (if all histograms for a plot are there) and/or copy some histograms from the previous run (if only some histograms, e.g. for specific samples or variations, for a plot changed)
the compiled backend does not work with data-driven/late-splitting selections because it uses a single output file (not because of this PR, but I noticed it here, and that would be good to fix)

Edited Oct 22, 2021 by Pieter David

Refactoring for incremental runs

Merge request reports