Decide on Xbb dumping strategy
@dguest @miochoa @nkakati @dkobylia and I discussed the idea of re-using the dump-single-btag
machinery to also produce samples for Xbb tagger trainings. The general idea would be to write out information about the large-R jet and associated tracks similar to how we write single b-jets and associated tracks. This would lead to differences in the handling of the subjet information.
- There would only be one track h5 dataset per large-R jet. We could decorate the tracks with an int index to the subjet they are in, but we wouldn't write out separate track groups for each subjet (which is how things have been done previously if I understand correctly).
- We could write out the subjet info as a 2d h5 dataset called
subjets
, with the first index selecting the subjet, and the second selecting some variable about the given subjet.
This would require major code changes to the single btag functions to accommodate the large-R jets, so we wanted to get some feedback before implementing anything. Tagging @vvecchio and @arelycg who maybe comment on whether the suggested output format would be acceptable.
Alternative
The alternative is to keep the existing dump-hbb
functions, which have not been maintained in r22 and are lacking many features that have been added to the single btag code. The subjets and tracks are all written out separately. The advantages of separate code can be that the Xbb group has more flexibility in what they add here. The disadvantages are that we need to do some work to get the code running (much progress has already been made by @miochoa in #29 (closed)), port missing features from the single btag code, and then maintain two different sets of code going forward for the different use cases.
TODO
-
Figure out what happened to GhostTruthLabelID
,GhostTruthLabelPt
, and theExtended
versions -
Write previous Xbb tagger scores !424 (merged) -
Write subjet information -
Decorate tracks with subjet ID