Skip to content

Sync of diphoton preselection, local and plain condor submission supported on lxplus and UCSD T2

Samuel May requested to merge smay/HiggsDNA:full_workflow into master

This PR implements:

Diphoton preselection

Addressing #7, see higgs_dna.taggers.diphoton_tagger module. Diphoton preselection has been synced to 0.01% level wrt flashgg. See presentation in Hgg working meeting [1] for more details. After some discussion, it was decided that the default treatment of events with multiple diphoton candidates should be to follow what is done in flashgg and take the candidate with the highest sum_pt. This is done at higgs_dna.taggers.diphoton_tagger level. This is usually <1% of events, but for analyses with electrons it can be a higher fraction (for electrons which pass the electron veto) and this may be worth revisiting.

Sample Management tool

Addressing #8 (see higgs_dna.samples.sample and higgs_dna.samples.samples_manager modules) which allows samples to be specified through a json format in several different ways:

  1. explicitly (a list of hard-coded file paths)
  2. via a local directory (SamplesManager will then use glob to get all root files in the directory)
  3. via an xrootd directory (SamplesManager will then use the xrdfs ls command to get all of the files)
  4. via a DAS-style dataset, e.g. /DoubleEG/Run2017B.../NANOAOD (SamplesManager will then use dasgoclient to get all of the files)

Things to be improved/open questions:

  • I notice that when accessing files through the Fermilab xrd redirector (for US sites) (root://cmsxrootd.fnal.gov/) or the Eurasia redirector (root://xrootd-cms.infn.it/) or even the CERN global redirector (root://cms-xrd-global.cern.ch/), I often get errors due to timeouts. Is this just life? Or are there ways around this?
  • Calculating of MC metadata (n_events, sum_of_weights) should be done at job-level, not all at once by the SampleManager - this takes a very long time and crashes if one file is corrupt/unavailable through xrootd. Now implemented.
  • Deal with corrupt files in a more intelligent way -- what to do if 1/10k data files is corrupt/unavailable? For MC, more straightforward: corrupt/unavailable MC files should be removed from consideration (after sufficient number of tries to account for intermittent unavailability from xrootd). JobsManager will now "retire" jobs if they fail more than N (configurable) times and allows the script to finish anyway (but gives you a warning about each of the retired jobs).

Job Management tool

Addressing #10 (see higgs_dna.job_management modules). The job management tools are heavily based on ProjectMetis [2], primarily written by Nick Amin. Jobs can either be submitted locally (running multiple jobs in parallel) or through HTCondor. The tool has been verified to work on lxplus as well as the UCSD T2.

The job management tool does the following:

  • Create jobs for each individual sample and year, correctly propagating sample/year-specific arguments to each job (e.g. corrections, systematics, etc). Within each sample/year, the files will be split between jobs with a specified number of files per job.
  • Create submission scripts (executables, condor submission scripts) and any necessary inputs (tarfile of conda environment) for the jobs.
  • Submit all jobs to the specified batch system (local or HTCondor) and monitor jobs, resubmitting failed jobs.
  • Once a sufficient number of jobs has finished (should be 100% for data, but does not need to be for MC, especially in the case of a single corrupted file), it will calculate the sum of weights of all successfully processed files and calculate a scale1fb for each MC sample/year.
  • If the user specifies, merge all of the output parquet files into a single file. If there are systematics with independent collections, these will be merged into separate files (each IC can have a different number of events, so it does not make sense to merge these into the same parquet file). It will also add branches for year, process_id, and apply the scale1fb and normalization info (cross section x branching fraction x luminosity) to each of the weight branches. The process_id field allows the individual processes to be identified. The assignment of each Sample to a process_id is recorded in a json file that is output along with the merged parquet files.

Things to be improved/open questions:

  • Better way of determining the optimal number of files per job: columnar operations become more optimal the larger the arrays we run over. In light of this, it would probably be best to define a target maximum memory consumption for the job and pick the number of files per job such that we are around this number. This is dependent on the sample (MC has systematic variations, data does not) and the process (different processes will have different efficiencies), so this is maybe a bit of chasing windmills... but would be nice.
  • When creating a tarfile of the conda environment, this gets pretty large (~half a GB) -- not sure if there is a way to reduce the size. I explored the compression factor, but this only helps by O(10%).
  • Related to above, the conda pack command sometimes takes 30s, sometimes takes 5min. Unsure why...
  • The condor_submit command is intermittently extremely slow on lxplus (sometimes around 10s per condor_submit), I cannot figure out why... Never figured out why, but now submit jobs in batches of 100, which gives much more reasonable runtime.
  • As discussed multiple times, this job management tool can become a "backup" to the coffea-style of submitting jobs, which would allow us to utilize other modern tools like Dask, parsl, etc.

Analysis manager tool

Addressing #11 (see higgs_dna.analysis module). This module serves as a wrapper for the TagSequence, SystematicsProducer, SamplesManager and JobsManager classes. It owns instances of each of these and controls the analysis at a high level. The AnalysisManager class is pickle-able, and saves itself repeatedly throughout running. This has the nice effect that if you stop running your analysis in the middle (e.g. you ctrl+c, lose your screen, etc), you can run the same command and the run_analysis.py script will detect the previously saved AnalysisManager pkl file and resume progress. This way, it will still remember the status of all of your jobs (e.g. which ones finished, the id of the ones that are currently running on condor).

Script for running an analysis

The scripts/run_analysis.py script allows the user to run conceivably any type of analysis, from

  1. making "ntuples" for deriving a systematic
  2. making "ntuples" for an analysis preselection to use for developing an analysis (data/MC plots, training ML algorithms, etc) 3. running a full analysis and making "workspaces" for use with final fits

An entire analysis is specified through a json file, where there are 5 main things to specify:

  1. TagSequence -- the user specifies higgs_dna.tagger.Tagger objects. The user can also specify kwargs of each Tagger object to run the Tagger with options other than the default options for that Tagger.
  2. Systematics -- the user specifies dictionaries for both weight systematics and systematics with independent collections. Systematics can either be read from existing branches in the input nanoAOD or can be calculated on-the-fly, through a function specified in the entry for that systematic.
  3. Input branches --

See this example of a sample json config for an analysis.

Things to be improved/open questions:

  • Can we automatically detect the branches which need to be read from nanoAOD, rather than specifying by hand? It is an important point to read only the branches which will be used, as I found that around 75% (90%) of runtime for MC (data) is spent on simply loading the nanoAOD files (this is for a simple analysis with the diphoton preselection and some dummy systematics). It is a bit tedious to specify all branches by hand...
  • I think it would be nice to summarize the physics content of an analysis in a json. There are many printouts for individual jobs, but it would be nice to merge all of this and have something like:
    1. efficiency of each cut in each tagger for each sample/year
    2. mean and std dev of each weight variation (for central/up/down) for each sample/year
    3. efficiency of selection on each systematic with an independent collection (additional info might be useful here as well)

A summary file with this information could save much debugging time, allowing users to easily spot buggy cuts and/or systematic implementations.

All of this can be tested by merging this PR in your local branch and running (only tested on lxplus and UCSD T2 so far):

conda activate higgs-dna
conda env update --file environment.yml --prune

to update your conda environment (I get errors with conda pack if I just update through pip install -e .) and then to run a short example on 2017 MC and partial 2017 data with local job submission:

python run_analysis.py --config "metadata/analysis/diphoton_preselection_short.json" --merge_outputs --log-level "DEBUG" --output_dir "test"

or to run on full Run 2 MC (ttH, ggH) and data with condor submission:

python run_analysis.py --config "metadata/analysis/diphoton_preselection.json" --merge_outputs --log-level "DEBUG" --output_dir "test" --batch_system "condor"

Note:

  • When running the full Run 2, some jobs may fail due to corruptions in the custom nanoAODs stored at UCSD (to be fixed soon).
  • If running the full Run 2 on lxplus, you'll probably want to set the output_dir to somewhere in your /afs/cern.ch/work directory, otherwise you might run out of space in your home area.

Still to-do before merging: comment & clean up code (following sphinx style)

Comments/questions/criticisms are appreciated!

[1] https://indico.cern.ch/event/1071721/contributions/4551056/attachments/2320292/3950844/HiggsDNA_DiphotonPreselectionAndSystematics_30Sep2021.pdf [2] https://github.com/aminnj/ProjectMetis/tree/master/metis

Edited by Massimiliano Galli

Merge request reports