Discussion on systematics implementation

Hi, I'm opening this thread as a place to discuss the pros/cons of various ways of organizing the systematics with independent collections (ICs).

From the discussion in the open PR !2 (merged) , we had settled on something like the following: if we have systematics which vary, for example, the photon pt, we can organize our events array as:

{ "Photon" :
    "pt" : {
        "nominal" : [...],
        "up" : [...],
        "down" : [...]
    },
    "phi" : [...],
    ...
}

Then, when we pass these events to a Tagger, it should automatically infer the number of independent collections from the shape of Photon.pt. I will call this the implicit method of dealing with systematics.

This is very similar in spirit to the proposed way of dealing with systematics in this coffea discussion .

Let's consider a more complex case now. Suppose we have Photon.pt and Photon.eta each varied up/down, so we want 5 ICs: nominal, pt_up, pt_down, eta_up, eta_down. Suppose our pt and eta selections are:

pt_cut # shape of events.Photon x 3
eta_cut # shape of events.Photon x 3

a Tagger should then have some method along the lines of:

Tagger.combine_cuts(pt_cut, eta_cut)

such that it properly outputs the 5 ICs (nominal = nominal pt, nominal eta, pt_up = pt up, nominal eta, pt_down = pt down, nominal eta, eta_up = nominal pt, eta up, eta_down = nominal pt, eta down).

This is doable, but I think there is strong possibility for unintended behavior: suppose a user simply does pt_cut & eta_cut. This executes without any error, as they have the same shape, but the resulting output is 3 ICs, 1 of which is the nominal case and the other 2 mix the photon and eta systematics. Similarly, a user might do Tagger.combine_cuts(pt_cut1, pt_cut2) rather than pt_cut1 & pt_cut2, which also runs without error, but results in unintended behavior.

The alternative to the implicit method is to explicitly tell the Tagger about all of the systematic variations it needs to loop through. I.e.

event_ICs = {
   "nominal" : events_nominal,
   "photon_pt_up" : events_photon_pt_up, # this should act exactly the same as events_nominal, but Photon.pt returns the varied up pt
   "photon_pt_down" : events_photon_pt_down,
   "photon_eta_up" : events_photon_eta_up,
   "photon_eta_down" : events_photon_eta_down
}

The main downsides to the explicit method are that it will not be as optimal in terms of performance (less vectorized), and a Tagger/TagSequence then needs to know beforehand about all of the systematics with ICs that you are evaluating, rather than automatically detecting them. On the other hand, you could argue that this is actually a benefit: the user is forced to say exactly which systematics they are evaluating, rather than have the framework guess for them.

To summarize,

Computing performance: implicit approach is better
Risk of unintended behaviors: explicit approach is safer

What are other people's thoughts?

My initial feeling is that I would prefer the explicit approach, as I think it is cleaner to understand what is being done, it forces the user to define what systematics they want to evaluate, and there is less possibility for users to make unintended mistakes.

Edited Apr 22, 2021 by Samuel May