It would be great if you added this to ml_tools
in the main branch of coffea. We already have xgboost, pytorch, triton.
Unless someone else wants to do this, I'm happy to make a fork or give a PR to the tool so it can handle the larger-bitwidth offsets. It's an opportunity for someone to learn what's going on so I figured I'd let someone else take a shot at it, after describing a bit the issue.
Here - adding to the coffea fileset metadata the cross section will get you a good way to the finish line here.
The rest looks like you want to calculate an overall acceptance N_all_categories/N_input_events that you'll calculate in the batch cluster and add to your final dumps. Then your (eps * A)_ij can be calculated from the output files. Right?
Hey! So this is because of: https://github.com/ponyisi/parquet_to_root/blob/main/parquet_to_root/parquet_to_root_pyroot.py#L147-L149
In particular this version of the parquet converter doesn't know about LargeListType, which is accounted separately in Parquet because of the larger bitwidth!
In terms of how they operate for ragged arrays they're exactly the same, so you just need to tell statements like this to treat LargeListType like ListType and you're done!
Also, I'd shy away from using pyroot, it can get very slow for large numbers of columns. You should be able to build the TTree you want with uproot.
Better yet - you may simply want to make your statistics tool able to read in parquet files via awkward (or dask_awkward)!
@jspah In practice we've found that automated updates/downloads can be a bit fraught with unintentional error (on either side!). It's definitely a good idea to have all the corrections you need defined in one or a few files so maintenance is easy on the code side: https://github.com/nsmith-/boostedhiggs/blob/master/boostedhiggs/corrections.py and then connect that to a directory (as you see at the top of corrections.py) where you keep the corrections themselves and update those through pull requests: https://github.com/nsmith-/boostedhiggs/tree/master/boostedhiggs/data
Then just make it so the repository builds the corrections when checked out fresh.
@magalli Also - let me know when you'd like a review - happy to help out and give commentary.
Hey - just a fair bit of warning to be very careful with xgboost from PyPI, the default pip install
of it pulls a 500MB package with full GPU support, which is not typically needed (and also slow compared to the forest inference library in nVidia triton). The conda package is CPU-only and > 10x smaller, there's some incantation for getting the right package via PyPI, I'll find it.
haddnano.py at sites:
Nulls:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.normal(size=(10_000_000, 10)), columns=list("ABCDEFGHIJ"))
df.to_parquet("dense.parquet")
for col in df.columns:
df.loc[df.sample(frac=0.4).index, col] = pd.NA
df.to_parquet("nulled.parquet")
Summaries:
Nanoevents:
@smay - @magalli asked me to have a look. Below are higher level things that were described or desire was expressed for. I'll get to the issue about github/lab/CI/etc in a bit.
Data organization issue:
One major thing sticks out to me, on writing out multiple files because of differences via systematics:
Some commentary on "remaining issues", based on my experience.
on xrootd-redirectors:
on dealing with bad files:
on not specifying all branches by hand:
You may achieve this by scanning the root file metadata first to get the branch structure (this is done by default on opening the TTree in the root file since you have to do that to be able to deserialize the streamers). You can then use that knowledge to make lazy arrays for branches you want and construct objects and then event records out of those, all lazily, and not making counter arrays for each branch you want (big memory saver). You can even go so far as to implement gen matching, gen parentage tree walking, nanoaod index-based cross cleaning, etc... as dynamically created branches, waiting until the last moment to read in relevant lower level information.
If you do not wish to go through the pain of implementing that yourself, or do not have time, it has been implemented in coffea for quite some time as "nanoevents": https://github.com/CoffeaTeam/coffea/tree/master/coffea/nanoevents (which is a misnomer now since it reads Delphes/DAODPhysLite from Atlas as well, with additional abstractions for other file formats and data sources). We have verified this all works quite well since early 2020 and people continually add to it and improve functionality.
You can use the processor interface locally in a batch job with the iterative executor within your job submission tool, that should be reasonably compatible (since other people do this already to varying degrees of complexity). Should also be compatible with your frontend for systematics, with some small tweaks.
getting summary information out of jobs in a json:
+
or update()
, applied recursively to all subnodes. If you enforce in the framework a subsection of that is pure json spec (dictionaries, lists, strings/bytes, numbers, ...) then it can be sliced out and serialized or displayed in some way automatically. We also have a switch for some uniform technical statistics like bytes read and wall/system times. You can find multiple examples of this specification in https://github.com/CoffeaTeam/coffea/blob/master/coffea/processor/executor.py#L478 (we call them "accumula(ta)bles"). Full analysis example here: https://github.com/nsmith-/boostedhiggs/blob/master/boostedhiggs/hbbprocessor.py, look for make_outputs
function and then follow outputs
variable through code. For what you want you'd have some sub-dictionary labelled "summaries"
or something that could be made common via base class.on optimizing job hyperparameters (files per job, chunk size, etc.):
on anything related to conda pack:
slow lxplus condor_submit
:
Queue
commands in the submission file) instead of submitting individual jobs for each (group of) files as much as you can? I do not know what Project Metis does to submit jobs.Lindsey Gray (237e711f) at 23 Aug 21:16
Our CD also produces centos7 images, which are useful for people working in that software stack.
Lindsey Gray (237e711f) at 23 Aug 21:14
add centos7 coffea images