Sync of diphoton preselection, local and plain condor submission supported on lxplus and UCSD T2
This PR implements:
Diphoton preselection
Addressing #7, see higgs_dna.taggers.diphoton_tagger
module. Diphoton preselection has been synced to 0.01% level wrt flashgg
. See presentation in Hgg working meeting [1] for more details. After some discussion, it was decided that the default treatment of events with multiple diphoton candidates should be to follow what is done in flashgg
and take the candidate with the highest sum_pt
. This is done at higgs_dna.taggers.diphoton_tagger
level. This is usually <1% of events, but for analyses with electrons it can be a higher fraction (for electrons which pass the electron veto) and this may be worth revisiting.
Sample Management tool
Addressing #8 (see higgs_dna.samples.sample
and higgs_dna.samples.samples_manager
modules) which allows samples to be specified through a json
format in several different ways:
- explicitly (a list of hard-coded file paths)
- via a local directory (
SamplesManager
will then useglob
to get all root files in the directory) - via an
xrootd
directory (SamplesManager
will then use thexrdfs ls
command to get all of the files) - via a DAS-style dataset, e.g.
/DoubleEG/Run2017B.../NANOAOD
(SamplesManager
will then usedasgoclient
to get all of the files)
Things to be improved/open questions:
- I notice that when accessing files through the Fermilab xrd redirector (for US sites) (
root://cmsxrootd.fnal.gov/
) or the Eurasia redirector (root://xrootd-cms.infn.it/
) or even the CERN global redirector (root://cms-xrd-global.cern.ch/
), I often get errors due to timeouts. Is this just life? Or are there ways around this? -
Calculating of MC metadata (Now implemented.n_events
,sum_of_weights
) should be done at job-level, not all at once by theSampleManager
- this takes a very long time and crashes if one file is corrupt/unavailable throughxrootd
. -
Deal with corrupt files in a more intelligent way -- what to do if 1/10k data files is corrupt/unavailable? For MC, more straightforward: corrupt/unavailable MC files should be removed from consideration (after sufficient number of tries to account for intermittent unavailability from xrootd).JobsManager
will now "retire" jobs if they fail more than N (configurable) times and allows the script to finish anyway (but gives you a warning about each of the retired jobs).
Job Management tool
Addressing #10 (see higgs_dna.job_management
modules). The job management tools are heavily based on ProjectMetis
[2], primarily written by Nick Amin. Jobs can either be submitted locally (running multiple jobs in parallel) or through HTCondor
. The tool has been verified to work on lxplus
as well as the UCSD T2.
The job management tool does the following:
- Create jobs for each individual sample and year, correctly propagating sample/year-specific arguments to each job (e.g. corrections, systematics, etc). Within each sample/year, the files will be split between jobs with a specified number of files per job.
- Create submission scripts (executables, condor submission scripts) and any necessary inputs (tarfile of
conda
environment) for the jobs. - Submit all jobs to the specified batch system (local or
HTCondor
) and monitor jobs, resubmitting failed jobs. - Once a sufficient number of jobs has finished (should be 100% for data, but does not need to be for MC, especially in the case of a single corrupted file), it will calculate the sum of weights of all successfully processed files and calculate a scale1fb for each MC sample/year.
- If the user specifies, merge all of the output
parquet
files into a single file. If there are systematics with independent collections, these will be merged into separate files (each IC can have a different number of events, so it does not make sense to merge these into the sameparquet
file). It will also add branches foryear
,process_id
, and apply thescale1fb
and normalization info (cross section x branching fraction x luminosity) to each of the weight branches. Theprocess_id
field allows the individual processes to be identified. The assignment of eachSample
to aprocess_id
is recorded in ajson
file that is output along with the mergedparquet
files.
Things to be improved/open questions:
- Better way of determining the optimal number of files per job: columnar operations become more optimal the larger the arrays we run over. In light of this, it would probably be best to define a target maximum memory consumption for the job and pick the number of files per job such that we are around this number. This is dependent on the sample (MC has systematic variations, data does not) and the process (different processes will have different efficiencies), so this is maybe a bit of chasing windmills... but would be nice.
- When creating a tarfile of the
conda
environment, this gets pretty large (~half a GB) -- not sure if there is a way to reduce the size. I explored the compression factor, but this only helps by O(10%). - Related to above, the
conda pack
command sometimes takes 30s, sometimes takes 5min. Unsure why... -
TheNever figured out why, but now submit jobs in batches of 100, which gives much more reasonable runtime.condor_submit
command is intermittently extremely slow onlxplus
(sometimes around 10s percondor_submit
), I cannot figure out why... - As discussed multiple times, this job management tool can become a "backup" to the coffea-style of submitting jobs, which would allow us to utilize other modern tools like Dask, parsl, etc.
Analysis manager tool
Addressing #11 (see higgs_dna.analysis
module). This module serves as a wrapper for the TagSequence
, SystematicsProducer
, SamplesManager
and JobsManager
classes. It owns instances of each of these and controls the analysis at a high level. The AnalysisManager
class is pickle-able, and saves itself repeatedly throughout running. This has the nice effect that if you stop running your analysis in the middle (e.g. you ctrl+c
, lose your screen, etc), you can run the same command and the run_analysis.py
script will detect the previously saved AnalysisManager
pkl
file and resume progress. This way, it will still remember the status of all of your jobs (e.g. which ones finished, the id of the ones that are currently running on condor).
Script for running an analysis
The scripts/run_analysis.py
script allows the user to run conceivably any type of analysis, from
- making "ntuples" for deriving a systematic
- making "ntuples" for an analysis preselection to use for developing an analysis (data/MC plots, training ML algorithms, etc) 3. running a full analysis and making "workspaces" for use with final fits
An entire analysis is specified through a json
file, where there are 5 main things to specify:
-
TagSequence
-- the user specifieshiggs_dna.tagger.Tagger
objects. The user can also specifykwargs
of eachTagger
object to run theTagger
with options other than the default options for thatTagger
. -
Systematics
-- the user specifies dictionaries for both weight systematics and systematics with independent collections. Systematics can either be read from existing branches in the input nanoAOD or can be calculated on-the-fly, through a function specified in the entry for that systematic. - Input branches --
See this example of a sample json config for an analysis.
Things to be improved/open questions:
- Can we automatically detect the branches which need to be read from nanoAOD, rather than specifying by hand? It is an important point to read only the branches which will be used, as I found that around 75% (90%) of runtime for MC (data) is spent on simply loading the nanoAOD files (this is for a simple analysis with the diphoton preselection and some dummy systematics). It is a bit tedious to specify all branches by hand...
- I think it would be nice to summarize the physics content of an analysis in a
json
. There are many printouts for individual jobs, but it would be nice to merge all of this and have something like:- efficiency of each cut in each tagger for each sample/year
- mean and std dev of each weight variation (for central/up/down) for each sample/year
- efficiency of selection on each systematic with an independent collection (additional info might be useful here as well)
A summary file with this information could save much debugging time, allowing users to easily spot buggy cuts and/or systematic implementations.
All of this can be tested by merging this PR in your local branch and running (only tested on lxplus
and UCSD T2 so far):
conda activate higgs-dna
conda env update --file environment.yml --prune
to update your conda
environment (I get errors with conda pack
if I just update through pip install -e .
) and then to run a short example on 2017 MC and partial 2017 data with local job submission:
python run_analysis.py --config "metadata/analysis/diphoton_preselection_short.json" --merge_outputs --log-level "DEBUG" --output_dir "test"
or to run on full Run 2 MC (ttH, ggH) and data with condor submission:
python run_analysis.py --config "metadata/analysis/diphoton_preselection.json" --merge_outputs --log-level "DEBUG" --output_dir "test" --batch_system "condor"
Note:
- When running the full Run 2, some jobs may fail due to corruptions in the custom nanoAODs stored at UCSD (to be fixed soon).
- If running the full Run 2 on
lxplus
, you'll probably want to set theoutput_dir
to somewhere in your/afs/cern.ch/work
directory, otherwise you might run out of space in your home area.
Still to-do before merging: comment & clean up code (following sphinx
style)
Comments/questions/criticisms are appreciated!
[1] https://indico.cern.ch/event/1071721/contributions/4551056/attachments/2320292/3950844/HiggsDNA_DiphotonPreselectionAndSystematics_30Sep2021.pdf [2] https://github.com/aminnj/ProjectMetis/tree/master/metis