Implement tool for managing samples

For running at full scale, we will need tools for running over different samples.

Starting from the user point-of-view, this would ideally be as simple as specifying a list of strings, e.g.

sample_list = ["ttH_M125", "tHq_M125", "Data"]

and then giving this to a class SampleManager that creates a Sample instance for each:

from higgs_dna.samples.sample_manager import SampleManager
sample_manager = SampleManager(
    samples = sample_list,
    years = ["2016", "2017", "2018"]
)
samples = sample_manager.produce()

where samples is a list of Sample objects that would contain:

Sample.files[year] # list of nanoAOD files for specified year
Sample.xs
Sample.scale1fb[year]
...

A Sample object should also be able to specify when there are specific systematics/reweightings/etc that should be applied to this sample.

In practice, we could deal with this by creating a json file with metadata about each sample, similar to the (https://github.com/cms-analysis/flashgg/blob/dev_legacy_runII/MetaData/data/cross_sections.json)[cross sections json in flashgg].

For a given sample, we might have an entry like:

    "ttH_M125" : {
        "xs" : XX, # pb
        "files" : {
            "2016" : [file1.root, file2.root, ...] # could be hard-coded
            "2017" : "/ttH_M125/UL2017_production/NANOAODSIM" # or could provide DAS name and have a tool to look up file names
        },
        "systematics" : { # same construction as for any systematic
            "tth_specific_theory_unc" : {
                "type" : "event",
                "method" : "from_branch",
                ...
            }
        }
    }

Then, the list of Sample objects can be given to a JobsManager or similar, which will take care of setting up the sample-specific options (e.g. adding a sample-specific theory weight/unc to the SystematicsProducer) and splitting these up into jobs.

Edited Jun 14, 2021 by Samuel May