Data dependencies for persistency
Currently data dependencies are only known for the reconstruction objects for few selected locations. All other locations are picked up from TES by PackParticlesAndVertices. This is not maintainable and not functional. I got rid of this algorithm. So now for all objects (mostly line outputs) we need to give the input and output locations explicitly, and this also requires to know what object is in a given location to be able to use the right packing algorithm. It gets even messier for relation tables. For the moment I am doing some guess work to put things into right places but that's obviously a bad idea...
So the issue is how to get scheduler to figure out all this.
@jonrob @graven @rmatev @gligorov
Related MRs: !1085 (merged) LHCb!3268 (merged)
Update following coordination discussion 17/12/2021 (and discussions in the persistency meeting on 16/12/2021)
Current plan is the following
- Use the information known at the time of writing the data to create a configuration (json) file that documents all the data dependencies in the persistent data that was configured to be saved.
- Update the decoding algorithm run when reading the data to use this json file to recreate the data dependency tree. Users configure their jobs to request certain 'top level' data containers, the end products of their favourite HLT2 lines. The decoding algorithm uses the data dependencies deduced from the json file to create 'hidden' data dependencies (the data containers referenced by their top level selections) and ensures these are created on the correct order, etc.
The above is the 'minimum' needed for the start of run3, as we need to ensure the json information is enough to allow the reading jobs to be configured correctly. Support for schema evolution (as it will need to evolve over time as unknown issues come up) needs to be backed in from the start.
What the above does not discuss, intentionally, is the exact mechanism by which the json file is propagated to the reading job. This is something that can evolve once we have a working implementation. e.g. The First implementation is user has to directly pass the json file when configuring their reading job. This would work, but is clearly not as user friendly as we would want so various possible solution to automating this where discussed
- Save the json with the ANN information.
- Save the json as a compressed blob as an addition raw bank, every event.
- Save the json as meta data per file (i.e. FSR like).
- Anything else...
The details on the right approach here can evolve with further discussion, once we have option 0 (users pass the file themselves) working.
@graven @sesen @rmatev @raaij If this doesn't agree with your recollection from the meetings the last few days please shout.
@gligorov FYI