Data discovery may or may not belong more in the initialize step in CAH
I wasn't sure whether to register this issue in CAFExample or CAFCore, since it effects both. Right now, we provide methods for (and plan to do) the following in the first couple of steps in an analysis:
Prepare - construct a sample folder with the desired DSIDs by using a XSec parser (typically with mapping and whitelist files provided in addition). Also use the XSec parser to discover data files on some file system (either through an 'ls' in the framework by providing a path, or by providing directly a list of input data files to use) and add these data files to the sample folder. Note that this step already requires access to the actual data files if just a path is provided.
Initialize - all that has to be done now in this step is to use the sample initializer on the MC samples to open each tree to extract the sumOfWeights and compute the sample normalization
It could make sense, however, to move the adding of the data files to the initialize step since in this case, the actual input files (data and MC both) would only ever need to be accessible during the second step and the first step would always have a chance to succeed even if the input files are not available.
Here are some of my current thoughts on the move:
-
Would we then add a data path already during prepare and just fill it with the input files during initialize? This would allow the typical tags such as channel, cand, usemcweights, etc. to already be applied also to the data sample folder during the first step
-
The benefit is already stated above. One downside, however, is that this would make the parallelization of the initialization step a bit more complicated since the data of course only needs to be added once in one of the jobs.