Skip to content
Snippets Groups Projects

Sync of diphoton preselection, local and plain condor submission supported on lxplus and UCSD T2

Merged Samuel May requested to merge smay/HiggsDNA:full_workflow into master

This PR implements:

Diphoton preselection

Addressing #7, see higgs_dna.taggers.diphoton_tagger module. Diphoton preselection has been synced to 0.01% level wrt flashgg. See presentation in Hgg working meeting [1] for more details. After some discussion, it was decided that the default treatment of events with multiple diphoton candidates should be to follow what is done in flashgg and take the candidate with the highest sum_pt. This is done at higgs_dna.taggers.diphoton_tagger level. This is usually <1% of events, but for analyses with electrons it can be a higher fraction (for electrons which pass the electron veto) and this may be worth revisiting.

Sample Management tool

Addressing #8 (see higgs_dna.samples.sample and higgs_dna.samples.samples_manager modules) which allows samples to be specified through a json format in several different ways:

  1. explicitly (a list of hard-coded file paths)
  2. via a local directory (SamplesManager will then use glob to get all root files in the directory)
  3. via an xrootd directory (SamplesManager will then use the xrdfs ls command to get all of the files)
  4. via a DAS-style dataset, e.g. /DoubleEG/Run2017B.../NANOAOD (SamplesManager will then use dasgoclient to get all of the files)

Things to be improved/open questions:

  • I notice that when accessing files through the Fermilab xrd redirector (for US sites) (root://cmsxrootd.fnal.gov/) or the Eurasia redirector (root://xrootd-cms.infn.it/) or even the CERN global redirector (root://cms-xrd-global.cern.ch/), I often get errors due to timeouts. Is this just life? Or are there ways around this?
  • Calculating of MC metadata (n_events, sum_of_weights) should be done at job-level, not all at once by the SampleManager - this takes a very long time and crashes if one file is corrupt/unavailable through xrootd. Now implemented.
  • Deal with corrupt files in a more intelligent way -- what to do if 1/10k data files is corrupt/unavailable? For MC, more straightforward: corrupt/unavailable MC files should be removed from consideration (after sufficient number of tries to account for intermittent unavailability from xrootd). JobsManager will now "retire" jobs if they fail more than N (configurable) times and allows the script to finish anyway (but gives you a warning about each of the retired jobs).

Job Management tool

Addressing #10 (see higgs_dna.job_management modules). The job management tools are heavily based on ProjectMetis [2], primarily written by Nick Amin. Jobs can either be submitted locally (running multiple jobs in parallel) or through HTCondor. The tool has been verified to work on lxplus as well as the UCSD T2.

The job management tool does the following:

  • Create jobs for each individual sample and year, correctly propagating sample/year-specific arguments to each job (e.g. corrections, systematics, etc). Within each sample/year, the files will be split between jobs with a specified number of files per job.
  • Create submission scripts (executables, condor submission scripts) and any necessary inputs (tarfile of conda environment) for the jobs.
  • Submit all jobs to the specified batch system (local or HTCondor) and monitor jobs, resubmitting failed jobs.
  • Once a sufficient number of jobs has finished (should be 100% for data, but does not need to be for MC, especially in the case of a single corrupted file), it will calculate the sum of weights of all successfully processed files and calculate a scale1fb for each MC sample/year.
  • If the user specifies, merge all of the output parquet files into a single file. If there are systematics with independent collections, these will be merged into separate files (each IC can have a different number of events, so it does not make sense to merge these into the same parquet file). It will also add branches for year, process_id, and apply the scale1fb and normalization info (cross section x branching fraction x luminosity) to each of the weight branches. The process_id field allows the individual processes to be identified. The assignment of each Sample to a process_id is recorded in a json file that is output along with the merged parquet files.

Things to be improved/open questions:

  • Better way of determining the optimal number of files per job: columnar operations become more optimal the larger the arrays we run over. In light of this, it would probably be best to define a target maximum memory consumption for the job and pick the number of files per job such that we are around this number. This is dependent on the sample (MC has systematic variations, data does not) and the process (different processes will have different efficiencies), so this is maybe a bit of chasing windmills... but would be nice.
  • When creating a tarfile of the conda environment, this gets pretty large (~half a GB) -- not sure if there is a way to reduce the size. I explored the compression factor, but this only helps by O(10%).
  • Related to above, the conda pack command sometimes takes 30s, sometimes takes 5min. Unsure why...
  • The condor_submit command is intermittently extremely slow on lxplus (sometimes around 10s per condor_submit), I cannot figure out why... Never figured out why, but now submit jobs in batches of 100, which gives much more reasonable runtime.
  • As discussed multiple times, this job management tool can become a "backup" to the coffea-style of submitting jobs, which would allow us to utilize other modern tools like Dask, parsl, etc.

Analysis manager tool

Addressing #11 (see higgs_dna.analysis module). This module serves as a wrapper for the TagSequence, SystematicsProducer, SamplesManager and JobsManager classes. It owns instances of each of these and controls the analysis at a high level. The AnalysisManager class is pickle-able, and saves itself repeatedly throughout running. This has the nice effect that if you stop running your analysis in the middle (e.g. you ctrl+c, lose your screen, etc), you can run the same command and the run_analysis.py script will detect the previously saved AnalysisManager pkl file and resume progress. This way, it will still remember the status of all of your jobs (e.g. which ones finished, the id of the ones that are currently running on condor).

Script for running an analysis

The scripts/run_analysis.py script allows the user to run conceivably any type of analysis, from

  1. making "ntuples" for deriving a systematic
  2. making "ntuples" for an analysis preselection to use for developing an analysis (data/MC plots, training ML algorithms, etc) 3. running a full analysis and making "workspaces" for use with final fits

An entire analysis is specified through a json file, where there are 5 main things to specify:

  1. TagSequence -- the user specifies higgs_dna.tagger.Tagger objects. The user can also specify kwargs of each Tagger object to run the Tagger with options other than the default options for that Tagger.
  2. Systematics -- the user specifies dictionaries for both weight systematics and systematics with independent collections. Systematics can either be read from existing branches in the input nanoAOD or can be calculated on-the-fly, through a function specified in the entry for that systematic.
  3. Input branches --

See this example of a sample json config for an analysis.

Things to be improved/open questions:

  • Can we automatically detect the branches which need to be read from nanoAOD, rather than specifying by hand? It is an important point to read only the branches which will be used, as I found that around 75% (90%) of runtime for MC (data) is spent on simply loading the nanoAOD files (this is for a simple analysis with the diphoton preselection and some dummy systematics). It is a bit tedious to specify all branches by hand...
  • I think it would be nice to summarize the physics content of an analysis in a json. There are many printouts for individual jobs, but it would be nice to merge all of this and have something like:
    1. efficiency of each cut in each tagger for each sample/year
    2. mean and std dev of each weight variation (for central/up/down) for each sample/year
    3. efficiency of selection on each systematic with an independent collection (additional info might be useful here as well)

A summary file with this information could save much debugging time, allowing users to easily spot buggy cuts and/or systematic implementations.

All of this can be tested by merging this PR in your local branch and running (only tested on lxplus and UCSD T2 so far):

conda activate higgs-dna
conda env update --file environment.yml --prune

to update your conda environment (I get errors with conda pack if I just update through pip install -e .) and then to run a short example on 2017 MC and partial 2017 data with local job submission:

python run_analysis.py --config "metadata/analysis/diphoton_preselection_short.json" --merge_outputs --log-level "DEBUG" --output_dir "test"

or to run on full Run 2 MC (ttH, ggH) and data with condor submission:

python run_analysis.py --config "metadata/analysis/diphoton_preselection.json" --merge_outputs --log-level "DEBUG" --output_dir "test" --batch_system "condor"

Note:

  • When running the full Run 2, some jobs may fail due to corruptions in the custom nanoAODs stored at UCSD (to be fixed soon).
  • If running the full Run 2 on lxplus, you'll probably want to set the output_dir to somewhere in your /afs/cern.ch/work directory, otherwise you might run out of space in your home area.

Still to-do before merging: comment & clean up code (following sphinx style)

Comments/questions/criticisms are appreciated!

[1] https://indico.cern.ch/event/1071721/contributions/4551056/attachments/2320292/3950844/HiggsDNA_DiphotonPreselectionAndSystematics_30Sep2021.pdf [2] https://github.com/aminnj/ProjectMetis/tree/master/metis

Edited by Massimiliano Galli

Merge request reports

Approval is optional

Merged by Massimiliano GalliMassimiliano Galli 3 years ago (Oct 11, 2021 2:07pm UTC)

Merge details

  • Changes merged into master with 58d45dc0 (commits were squashed).
  • Deleted the source branch.

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Samuel May requested review from @magalli

    requested review from @magalli

    • Hey @smay, would it be possible to have separate PRs?

      The reason is that the photon pre-selection is straightforward and can be immediately merged, while the "backup submission system" is a bit more delicate, especially if we want to make it co-exist with the coffea way of submitting jobs.

    • Author Contributor

      @magalli and I discussed offline, but for posterity: we decided that there is no rush to merge the diphoton preselection, so we can just leave this PR up for a while to give people plenty of time to review and give feedback. If we need the diphoton preselection before everyone agrees on merging, a separate PR will be made.

    • Please register or sign in to reply
  • Samuel May added 3 commits

    added 3 commits

    • 4d504cd7 - Scale weights by lumi, keep same process id for different years of the same sample
    • 4324179a - Save year info in outputs when merging
    • 41273a07 - Add option for saving object fields in flat arrays with dummy-value padding

    Compare with previous version

  • Samuel May added 1 commit

    added 1 commit

    • 157c2648 - Update example script for running diphoton preselection on data + MC

    Compare with previous version

  • Samuel May added 2 commits

    added 2 commits

    • 768ca198 - Fix merging of outputs with condor submission on T2's
    • b7c05bcf - Add hint about conda pack errors

    Compare with previous version

  • Samuel May changed the description

    changed the description

  • Samuel May changed the description

    changed the description

  • Samuel May added 1 commit

    added 1 commit

    • 7ed07da9 - Calculate MC metadata inside jobs (rather than all at once at the beginning of...

    Compare with previous version

  • Samuel May added 1 commit

    added 1 commit

    • 7863fc20 - Fix :bug: in AnalysisManager constructor

    Compare with previous version

  • Samuel May added 1 commit

    added 1 commit

    • f54f1475 - Improve condor job submission (condor_submit is still very slow on lxplus for some reason...)

    Compare with previous version

  • Samuel May changed the description

    changed the description

  • Samuel May added 2 commits

    added 2 commits

    • f9093fd2 - Fix division by zero :bug: , add more progress bars
    • 4e9ae076 - Fix cherry-pick conflicts

    Compare with previous version

    • @smay - @magalli asked me to have a look. Below are higher level things that were described or desire was expressed for. I'll get to the issue about github/lab/CI/etc in a bit.

      Data organization issue:

      • It would be best, from a data handling perspective, if your analysis datasets are contained in large NanoAOD files (~1-2M events and split them up rather than process multiple files in a single job). Each time you ask xrootd to open a file can generally be considered a possible point of failure (more on that below).

      One major thing sticks out to me, on writing out multiple files because of differences via systematics:

      • Parquet files support nullable data and structured data (since it was made for the business world where often some entry in a form is forgotten and left blank, null entries are a basic property of data, unlike our detector data), hence you can either create and index/axis in the pandas dataframe to label systematics (and then each index may have varying numbers of events) or make null data entries for columns in the case an event is selected given some systematic variation and not in another. I would personally pay this significant thought and discuss the architecture here before moving further since more files is always more trouble.

      Some commentary on "remaining issues", based on my experience.

      on xrootd-redirectors:

      • Do not use them, they are very flakey/terrible, it really needs to be fixed by that team. For production it's not such a big deal since they can just axe off that part of the dataset or produce more events, for analysis it's a major concern as we do not have that freedom. It would be better to get an xcache instance setup at lxplus, the redirection code in it is better (this is what coffea-casa analysis facility uses) than the widely used xrootd redirectors. In the absence of that solution use the exact endpoint for the files, it is far more robust but will be slow if data is not site-local. TBH the best practice for now is to have all necessary data local to the site where processing is happening. This also largely fixes issues with "corrupt" files, most of them are actually xrootd redirector screw-ups, proven via simply moving things to compute site and using XRD endpoint (a.k.a. suddenly everything worked when we removed problematic pieces of data access chain).

      on dealing with bad files:

      • If the file is still not opening even if some redirector issue is excluded. A second order XRD issue is making sure you have file replicas, this mitigates on-site access issues (easy to do this with rucio for centrally managed samples, private production is a WIP). If it's really, actually, corrupt then we've found some reasonable logic for dealing with that here: https://github.com/CoffeaTeam/coffea/blob/master/coffea/processor/executor.py#L922-L943. That's a little snippet so take it as you will. We wrap it so we can guide a for-loop/multiprocessing-map/dask/workqueue/etc. to either die on bad files or skip them dynamically, you can sort out weight1fb's after doing the processing by looking at root file UUIDs, scanning the Run trees, and adjusting for missed chunks within files (could also do per job via tree-reduce and some special rules but it's fast enough and easily cacheable from our perspective).

      on not specifying all branches by hand:

      • You may achieve this by scanning the root file metadata first to get the branch structure (this is done by default on opening the TTree in the root file since you have to do that to be able to deserialize the streamers). You can then use that knowledge to make lazy arrays for branches you want and construct objects and then event records out of those, all lazily, and not making counter arrays for each branch you want (big memory saver). You can even go so far as to implement gen matching, gen parentage tree walking, nanoaod index-based cross cleaning, etc... as dynamically created branches, waiting until the last moment to read in relevant lower level information.

      • If you do not wish to go through the pain of implementing that yourself, or do not have time, it has been implemented in coffea for quite some time as "nanoevents": https://github.com/CoffeaTeam/coffea/tree/master/coffea/nanoevents (which is a misnomer now since it reads Delphes/DAODPhysLite from Atlas as well, with additional abstractions for other file formats and data sources). We have verified this all works quite well since early 2020 and people continually add to it and improve functionality.

      • You can use the processor interface locally in a batch job with the iterative executor within your job submission tool, that should be reasonably compatible (since other people do this already to varying degrees of complexity). Should also be compatible with your frontend for systematics, with some small tweaks.

      getting summary information out of jobs in a json:

      • We do this in coffea, leaving it up to the user/framework-developer to return summary physics content information and histograms per task over a dataset, that is accumulated by a tree-reduce on batch, and returned to the user. The spec is now straightforward in that you return something which has a + or update(), applied recursively to all subnodes. If you enforce in the framework a subsection of that is pure json spec (dictionaries, lists, strings/bytes, numbers, ...) then it can be sliced out and serialized or displayed in some way automatically. We also have a switch for some uniform technical statistics like bytes read and wall/system times. You can find multiple examples of this specification in https://github.com/CoffeaTeam/coffea/blob/master/coffea/processor/executor.py#L478 (we call them "accumula(ta)bles"). Full analysis example here: https://github.com/nsmith-/boostedhiggs/blob/master/boostedhiggs/hbbprocessor.py, look for make_outputs function and then follow outputs variable through code. For what you want you'd have some sub-dictionary labelled "summaries" or something that could be made common via base class.

      on optimizing job hyperparameters (files per job, chunk size, etc.):

      • There has been significant work on this in the context of WorkQueue in coffea. Chunk sizing given a memory target can be calculated dynamically (though it invalidates some nice book-keeping / idempotency tricks exploiting ROOT UUIDs and event ranges if you're making derived tuples). Follow this variable around and you'll see an example of how it works https://github.com/CoffeaTeam/coffea/blob/master/coffea/processor/executor.py#L402. Ben Tovar at Notre Dame has spent a lot of time working on this with students at Notre Dame. Feel free to pick his brain.

      on anything related to conda pack:

      • You should talk to the developers of WorkQueue (https://github.com/cooperative-computing-lab/cctools), they have been using conda in production at scale for a number of years and understand many aspects of optimization therein. They also have a coffea interface for WorkQueue developed and used in live analysis (for top EFT analysis) if you are dead-set on conda (otherwise cvmfs singularity images are much more efficient for running on a cluster, it should be possible to intermingle with WQ with some talking to people and fiddling). They're also very curious about partial results and such too.

      slow lxplus condor_submit:

      • Use job clusters (i.e. multiple Queue commands in the submission file) instead of submitting individual jobs for each (group of) files as much as you can? I do not know what Project Metis does to submit jobs.
      • Talk to Ben Jones (ben.dylan.jones@cern.ch, CERN condor admin) about this... it's honestly kinda weird, but I just don't use lxplus that much and other sites are snappy.
      Edited by Lindsey Gray
    • Author Contributor

      Hi @lgray , thanks for the comments & suggestions, this is very helpful! Some responses and questions for you inline:

      It would be best, from a data handling perspective, if your analysis datasets are contained in large NanoAOD files (~1-2M events and split them up rather than process multiple files in a single job). Each time you ask xrootd to open a file can generally be considered a possible point of failure (more on that below).

      Yes, agree completely. Given that we hope to have all necessary variables added to central nanoAOD, do you know if there is a central set of nanoAOD where this is done? Or each site would need to haddnano.py and create their own set of locally stored nanoAODs? (Which probably wouldn't be a big deal given your later comments about xrootd unreliability)

      Parquet files support nullable data and structured data (since it was made for the business world where often some entry in a form is forgotten and left blank, null entries are a basic property of data, unlike our detector data), hence you can either create and index/axis in the pandas dataframe to label systematics (and then each index may have varying numbers of events) or make null data entries for columns in the case an event is selected given some systematic variation and not in another. I would personally pay this significant thought and discuss the architecture here before moving further since more files is always more trouble.

      This makes sense in general to me, however, we might consider still having 2 separate output files per job: 1 for the "nominal" events and 1 for all other systematic variations with independent collections (where syst_idx could be an additional field identifying which variation they belong to). The reason for putting the "nominal" events separately is that we typically store all the weight variations (e.g. some_scale_factor_up/down only for the nominal events. We can have up to 30-50 of these for a given analysis, and typically the only other outputs required for the final statistical analysis are the diphoton mass (and maybe pT). So the nominal events "tree" could easily have 30-50x the size of the "trees" for each syst with an IC.

      My question would then be: do "null" entries in a parquet file take up a significant amount of space? I.e. does my argument above make sense?

      getting summary information out of jobs in a json: We do this in coffea, leaving it up to the user/framework-developer to return summary physics content information and histograms per task over a dataset

      I was a bit misleading in my description: the workflow for this type of summary is also already in place in HiggsDNA, my point is that we should add more relevant information beyond what we already have (i.e. the work in HiggsDNA is on the "leaving it up to the user/framework-developer to return summary physics content information" part.

      You may achieve this by scanning the root file metadata first to get the branch structure (this is done by default on opening the TTree in the root file since you have to do that to be able to deserialize the streamers). You can then use that knowledge to make lazy arrays for branches you want and construct objects and then event records out of those, all lazily, and not making counter arrays for each branch you want (big memory saver). You can even go so far as to implement gen matching, gen parentage tree walking, nanoaod index-based cross cleaning, etc... as dynamically created branches, waiting until the last moment to read in relevant lower level information. If you do not wish to go through the pain of implementing that yourself, or do not have time, it has been implemented in coffea for quite some time as "nanoevents": https://github.com/CoffeaTeam/coffea/tree/master/coffea/nanoevents (which is a misnomer now since it reads Delphes/DAODPhysLite from Atlas as well, with additional abstractions for other file formats and data sources). We have verified this all works quite well since early 2020 and people continually add to it and improve functionality.

      This looks really nice (and @magalli had already brought up this feature previously on our list of coffea aspects to adapt), I think it would be great to use.

    • haddnano.py at sites:

      • if it's from central production this is done in the merging step of nanoaod production anyway, files sizes are typically fine. no need to do anything manually unless something very odd or dumb happened.
      • for private productions, do the production, haddnano.py once, then distribute hadd'd files

      Nulls:

      import pandas as pd
      import numpy as np
      df = pd.DataFrame(np.random.normal(size=(10_000_000, 10)), columns=list("ABCDEFGHIJ"))
      df.to_parquet("dense.parquet")
      for col in df.columns:
          df.loc[df.sample(frac=0.4).index, col] = pd.NA
      df.to_parquet("nulled.parquet")
      • pyarrow (the thing that actually writes the parquet files), can also do some pretty advanced things with compression like ROOT files can. With the pandas interface you can only specify gzip (which is a bit better for our data than snappy). If you convert to intermediate pyarrow format you can pass in lzma/zstd/lz4, which are even better. Not sure how much it matters right now since these files should be fairly small.

      Summaries:

      • Sure, if you're going to use the processor idiom (it'll help if you want nanoevents) then you'll just have to move some things around. Informing you how it would need to change.

      Nanoevents:

      • Good, it'll save you a lot of time and pain.
    • Please register or sign in to reply
257
258 # Load samples
259 self.samples = self.sample_manager.get_samples()
260
261 if not self.prepared_analysis:
262 self.prepare_analysis()
263
264 n_input_files = 0
265 n_tasks = 0
266 n_jobs = 0
267 for task in self.jobs_manager.tasks:
268 n_tasks += 1
269 n_input_files += len(task.files)
270 n_jobs += len(task.jobs)
271
272 logger.info("[AnalysisManager : run] Running %d tasks with %d total input files split over %d total jobs." % (n_tasks, n_input_files, n_jobs))
  • funcName is one of the arguments accepted by the logger itself, while for the class name I remember seeing a tweak somewhere on stackoverflow (I have to find it again). This way we don't need to hardcode it in the message and it is displayed nicely :)

  • Author Contributor

    Nice, thanks for pointing this out. Probably it's plenty if we just add the function name to be printed out, since the filename + line number is also printed out by default. I can update these when I go through and add comments.

  • Please register or sign in to reply
  • Samuel May added 4 commits

    added 4 commits

    • e24c7e85 - Only modify central weight when applying scale1fb and lumi, automatically save...
    • abaeab01 - Implement electron veto SF
    • a513dcd0 - Add function for syst from IC from bins, allow options for (1) syst with IC to...
    • 176bbaab - Implement MC trigger SF

    Compare with previous version

  • 1 {
    2 "name" : "example_analysis",
    3 "function" : {
  • 62 },
    63 "diphotons" : {
    64 "lead_pt" : 35.0,
    65 "sublead_pt" : 25.0,
    66 "lead_pt_mgg" : 0.33,
    67 "sublead_pt_mgg" : 0.25,
    68 "select_highest_pt_sum" : True
    69 }
    70
    71 }
    72
    26 73 # NOTE: pre-compiled numba functions should be defined outside the class,
    27 74 # as numba does not like when a class instance is passed to a function
    28 75 @numba.njit
    29 def produce_diphotons(photons, n_photons, lead_pt_cut):
    76 def produce_diphotons(photons, n_photons, lead_pt_cut, lead_pt_mgg_cut, sublead_pt_mgg_cut):
    • Is there a reason that I'm not getting here for adopting the jitting + explicit loop solution?

    • Author Contributor

      This goes back to what we discussed in the spring: we do not need to strictly adhere to either loop-style or columnar-style of processing events and we should have the ability (and examples) of doing things both ways. Many new students in the group will be more familiar with the loop-style of doing things and the logic may be more familiar to them.

      I made a comparison a while back (in one of our old HiggsDNA chats) between python-loop, columnar operations, and numba-compiled loops. Both columnar operations and numba-compiled loops are orders of magnitude faster than a plain-python loop and the difference between columnar and numba-compiled was negligible in comparison (in fact, the "winner" depended on the number of events we were processing).

      So, I would say it's simply a user preference whether a given selection is implemented as a columnar operation or a numba-compiled loop.

    • Please register or sign in to reply
  • Massimiliano Galli mentioned in issue #13

    mentioned in issue #13

  • Massimiliano Galli mentioned in merge request !5 (merged)

    mentioned in merge request !5 (merged)

  • Samuel May added 1 commit

    added 1 commit

    • e9ac3fa7 - Speed up job submission + monitoring on lxplus

    Compare with previous version

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading