Indexing jets rather than copying tracks
It's wasteful to copy and re-write tracks in every step of the preprocessing: the only information we really need there is a few jet parameters and the event index. I think this came up in a merge request (!285 (comment 5059782)) and may be related to #86 (closed), so I'm creating a ticket.
There are a few issues to work out overall.
- It's not completely trivial to index jets across many input files. We might try putting everything into a virtual dataset, or go with something more home-grown, but storing the file path as part of the jet index seems too heavy.
- Technically we could actually drop quite a bit of jet information as well: we really only need the pt, eta, and truth label to do most of the resampling.
- If we did end up applying further track selection (as @svanstro suggested in #86 (closed)), it would have to be pushed back to a later step in the preprocessing. Maybe not that big a deal but it sort of fragments the selection stages.
- At what step would we reconstitute the dataset? Right before the final flatten + scaling stage?
There are probably more questions, just wanted to try to consolidate the discussion somewhere. Tagging @mguth, @svanstro, and @sfranche, who were all part of the tickets / MRs above.