Draft: Expanded the xAOD RDataFrame helpers to include friend trees
The xAOD RDataFrame helpers allow RDataFrame to be used on a tree containing an xAOD. This extends these tools so that additional non-xAOD friend trees can be attached and read with RDataFrame at the same time as the primary xAOD tree.
Mechanics
At its core, the xAOD RDataFrame helper provides a new RDataSource subclass that handles the xAOD. To handle friend trees that RDataSource subclass was extended, copying the design from RRootDS.cxx. The parts that handle the xAOD tree are essentially unaltered by this extension.
(Perhaps there is some way to use multiple inheritance that would use both classes without changing the existing RDataSource, or copying anything from RRootDS. I couldn't see how to get round the methods marked final
, so I settled for extending the xAOD RDataSource.)
The friend tree is accessed from a second TChain. While this is wasteful of memory, because the xAOD is loaded twice, there doesn't appear to be an alternative.
- If I try to access the friend tree directly from the TChain used to read the xAOD I get some errors that appear to indicate that RDataFrame is confused by having the aux store loaded. For example;
Traceback (most recent call last):
File "/home/dayhahen/jetydaod/doodles/xAOD_RDF_friends.py", line 23, in <module>
mean_pt = filtered.Mean("pt").GetValue()
cppyy.gbl.SG.ExcNoAuxStore: const double& ROOT::RDF::RResultPtr<double>::GetValue() =>
ExcNoAuxStore: SG::ExcNoAuxStore: Requested aux data item `::pt' (38) but there is no associated aux data store.
- If I were to omit the xAOD tree from the TChain used for the friends they might not stay in sync. The number of events in the friend tree can be truncated by the primary tree, so I think it's best to always have the primary xAOD tree in the TChain.
So I have settled for simply handling xAOD access and friend tree access from different TChains, both of which contain the xAOD, but only the later of which contains friend trees.
Interface
Using python, the previous api was;
import ROOT; ROOT.xAOD.Init(); ROOT.xAOD.JetContainer_v1()
from xAODDataSource import Helpers
# Primary tree contains kinematics
primary_glob = "/home/dayhahen/jetydaod/example_data/rucio/2022_datasets/data22_13p6TeV.00432180.physics_Main.deriv.DAOD_PHYS.f1264_m2124_p5334_tid30924306_00/DAOD_PHYS.30924306._*.pool.root.1"
simple_xAOD_df = Helpers.MakexAODDataFrame(primary_glob)
The new api can be used like;
import ROOT; ROOT.xAOD.Init(); ROOT.xAOD.JetContainer_v1()
from xAODDataSource import Helpers
# Primary tree contains kinematics
primary_glob = "/home/dayhahen/jetydaod/example_data/rucio/2022_datasets/data22_13p6TeV.00432180.physics_Main.deriv.DAOD_PHYS.f1264_m2124_p5334_tid30924306_00/DAOD_PHYS.30924306._*.pool.root.1"
# as per the old api;
simple_xAOD_df = Helpers.MakexAODDataFrame(primary_glob)
# Or with friend trees;
primary_tree = "CollectionTree"
friend_glob = "/home/dayhahen/jetydaod/example_data/rucio/2022_datasets/data22_13p6TeV.00432180.physics_Main.deriv.DAOD_PHYS.f1264_m2124_p5334_tid30924306_00_friends/DAOD_PHYS.30924306._*.pool_friend.root.1"
friend_tree = "triggers"
# Make the df with both of them (could have more than one friend if needed)
friended_df = Helpers.MakexAODDataFrame(primary_glob, primary_tree, [friend_glob], [friend_tree])
It's also possible to replace the globs with lists of file names.
I could have added the option to provide a TChain as input, but I opted not to because;
- Technically, a TChain is capable of making one tree out of multiple trees of different names (provided their content matches). This is true of both the primary tree and it's friends. Currently the RDataSource doesn't support that.
- I'm not sure what else TChain can do that I'm unaware of. Each thread needs access to the input data, and TChain's copy constructor is disabled. Because it lacks a copy constructor, I'd need to be confident that I could create an equivalent data source for each thread, which means knowing all the things that could be specified by a TChain. Much easier to insist the user gives files/tree names for the friends, completely restricting the input to things I understand.
However if that decision is unpopular, I'm happy to change this.
Tests
It comes with 3 new tests; test/dataFrameFriends_test.cxx
test/dataFrameFriends_test.py
test/dataSourceFriends_test.cxx
. These run in the CI. Some of their warnings are ignored, but the values they calculate are checked.