Implement an analysis manager class
Running an analysis consists of specifying:
- list of samples
- list of systematics to consider (which may vary by sample)
- a tag sequence
- details of output file: variables of interest (
m_gg
,m_jj
, etc.) and output format (e.g.parquet
) - job submission details
I think it would be nice to have an Analysis
or similarly named class, which is constructed from the 5 inputs above and does Analysis.run()
to run an entire analysis.
For example, we might have:
sample_list = ["ttH_M125", "tHq_M125", "Data"]
years = ["2016", "2017", "2018"]
samples = SamplesManager(
sample_list = sample_list,
years = years
)
syst_options = {
"weights" : {
"dummy_theory_sf" : {
...
"independent_collections" : {
...
}
tag_sequence = TagSequence(
tag_list = [
diphoton_tagger,
[tth_tagger, thq_tagger]
]
)
jobs_manager = JobsManager(
batch = "local",
n_events_per_output = 10**6
)
analysis = Analysis(
samples = samples,
systematics = syst_options,
tag_sequence = tag_sequence,
variables_of_interest = ["m_gg", "m_jj"],
output_format = "parquet",
jobs_manager = jobs_manager
)
analysis.run()
where analysis.run()
would do the following:
- Go through each
Sample
insamples
and- construct the function to run the systematics + tag sequence (we may have e.g. different systematics for different samples)
- add jobs to the
jobs_manager
for eachSample
, taking into account the specific function for this sample
- Submit jobs through the
JobsManager
- Monitor jobs and record their metadata. At a very basic level, this would simply be checking whether the job succeeded. If a job succeeds, it would also be useful to record physics information about this job: how many events were processed, what are the efficiency of each
Tagger
's selections (and perhaps the efficiency of each cut of eachSelection
of eachTagger
), summary information about the systematics: what are the mean/std dev of each systematic, etc - Post-process: once a large enough fraction of jobs have finished (need 100% for data, but not strictly necessary for MC), merge outputs and update
scale1fb
according to the processed number of events for each sample. - Summarize: print out summary info and write a
json
with high-level info. This would entail properly merging the metadata returned by each job.