New approach for storing jet systematic information
Brief ntuple size (kB/event) history:
Type | V12 | V14 | V15a | V15b |
---|---|---|---|---|
MC (ttbar) | 6.3 | 5.5 | 9.0 | 5.5 |
Data (MET) | 1.6 | 1.0 | ? | ? |
From V12 to V14, we stopped storing the soft jets (pT < 30 GeV).
For the upcoming V15, keeping 5 copies of all of the AK4 jet properties (vector branches) ends up taking a significant amount of space (see V15a). (Previously, we only kept a few property branches, but now more are needed e.g. for top tagging or b-tagging studies.)
Instead, we can store the list of original indices for the systematically varied jet collections (which get resorted by pT), which removes the need to duplicate the property branches. For a further optimization, we don't even need to store the 4-vectors for the systematic collections (saving 0.5 kB/evt), because we already store the JEC and JER uncertainty factors (which can easily be used to rescale the central 4-vectors). This way, the most novel operation (sorting the jet collection) doesn't have to be repeated, but no information is duplicated. (We still keep the scalar computed quantities, like NJets etc., which also involve some potentially nontrivial computations.)
However, to make this work, we once again need to store all the jets down to the miniAOD cut of pT > 10 GeV. This causes the size of the central jets collection to increase substantially, but there's still an overall saving by eliminating the systematic collections (see V15b). It also enables studies with lower-pT jets, which may be desired by some analyses. (I tried just going down to pT > 20 GeV, but there are still some jets below 20 GeV that can fluctuate up above 30 GeV with the systematics.)
Another consideration is data vs MC. In V12 we have approximately 2x as many data events compared to MC. We don't store systematic variations for data, so including the soft jets down to 10 GeV will just increase the size. But even if we assume the data event size goes back to 1.6 kB, the MC size decreases enough that we save space overall. It is probably worthwhile to have consistent pT cuts in data and MC to avoid confusion. We should double-check this once we have 94X miniAODv2 data files available.