Implement an analysis manager class

Running an analysis consists of specifying:

  1. list of samples
  2. list of systematics to consider (which may vary by sample)
  3. a tag sequence
  4. details of output file: variables of interest (m_gg, m_jj, etc.) and output format (e.g. parquet)
  5. job submission details

I think it would be nice to have an Analysis or similarly named class, which is constructed from the 5 inputs above and does Analysis.run() to run an entire analysis.

For example, we might have:

sample_list = ["ttH_M125", "tHq_M125", "Data"]
years = ["2016", "2017", "2018"]
samples = SamplesManager(
    sample_list = sample_list,
    years = years
)

syst_options = {
        "weights" : {
            "dummy_theory_sf" : {
               ...
        "independent_collections" : {
            ...
}

tag_sequence = TagSequence(
    tag_list = [
        diphoton_tagger,
        [tth_tagger, thq_tagger]
    ]
)

jobs_manager = JobsManager(
    batch = "local",
    n_events_per_output = 10**6
)

analysis = Analysis(
    samples = samples,
    systematics = syst_options,
    tag_sequence = tag_sequence,
    variables_of_interest = ["m_gg", "m_jj"],
    output_format = "parquet",
    jobs_manager = jobs_manager
)

analysis.run()

where analysis.run() would do the following:

  1. Go through each Sample in samples and
    • construct the function to run the systematics + tag sequence (we may have e.g. different systematics for different samples)
    • add jobs to the jobs_manager for each Sample, taking into account the specific function for this sample
  2. Submit jobs through the JobsManager
  3. Monitor jobs and record their metadata. At a very basic level, this would simply be checking whether the job succeeded. If a job succeeds, it would also be useful to record physics information about this job: how many events were processed, what are the efficiency of each Tagger's selections (and perhaps the efficiency of each cut of each Selection of each Tagger), summary information about the systematics: what are the mean/std dev of each systematic, etc
  4. Post-process: once a large enough fraction of jobs have finished (need 100% for data, but not strictly necessary for MC), merge outputs and update scale1fb according to the processed number of events for each sample.
  5. Summarize: print out summary info and write a json with high-level info. This would entail properly merging the metadata returned by each job.