Skip to content

Draft: Experimental DAG implementation and YAML flattening in TextConfig for the ease of maintenance

Ho Chun Lau requested to merge holau/athena:fixConfigText into main

There are a few issues in the TextConfig.py and in ConfigFactory.py that makes maintenance challenging and difficult to use.

  • ConfigFactory register the available Config Blocks from other modules with addAlgConfigBlock(algName, alg, superblocks) where alg is a class or a function that point to a ConfigBlock/Sequence, and the superblocks specify the dependency into ConfigFactory._algs.
    • algName is not unique and superblocks is over generalized. For example, there are multiple algName=WorkingPoint, but they are pointing to different ConfigBlock Photon.WorkingPoint, Electron.WorkingPoint etc. Each of them has different implementation. At the first glance we can set algName with a prefix, but we immediately see that ptEtaSelection can be applied to multiple objects, but with the same implementation: naming like Electron.ptEtaSelection and Photon.ptEtaSelection will point to the same ConfigBlock. At second glance this is nasty but still working, both example must be a sub-algo, we can specify the dependency with superblocks, and check the WorkingPoint/ptEtaSelection is nested under which config. Then you will find SystObjectLink depends on Electrons and Photons but do not live under their config. One way to resolve this is to record the order and dependency in a nested structure as in the original design.
    • Due to the nested structure, retrieving object must be performed recursively each time. It makes every interactions more complicated. There is a hot-fix inside TextConfig.configure to add a placeholder, and clean it up later.
    • Keep track of the dependency recursively is very difficult (in ConfigFactory._order). The above example has shown there are multiple situations already. Relying solely on superblocks forced the whole setup to be nested.
  • The separation of responsibility is poor, and argument to initialize it are separated everywhere. This originates from two edge cases where ConfigBlock cannot be initialized without supplying a config.
    • Each FactoryBlock has to run .makeConfig() to create ConfigSequence. In the current design, it is built around two edge cases: Jets and EventSelection. In the TextConfig.configure(), it has responsibility to propagate options, feed the config to FactoryBlock, initialize it, set the value, check for excess config. I believe the authors put them here because they also found that you cannot possibly know the available options of two edge cases without giving them a config that contains the available options. In other ConfigBlock, you can use .getOptions. Because FactoryBlock cannot initialize ConfigBlock and getOptions, the burden of checking excess which suppose to be in FactoryBlock, is moved to the main thread, where all ConfigBlock can be initialized.
    • It also creates a problem that there are extra variables which can be contained in a single dictionary, now has to be separated into funcOpts and algOpts but they have the same purpose. Those options of course, have to be passed recursively too.

My redesign is trying to attack the problem above. I keep most of the interface untouched. The only two I changed is ConfigFactory.addAlgConfigBlock, and rename ConfigFactory._algs to ConfigFactory.catalog to address its purpose.

  • Many of the problem above can be solved by redesigning FactoryBlock and addAlgConfigBlock.
    • addAlgConfigBlock(self, algName, alg, defaults=None, hardDepend=None, softDepend=None, applyTo=None)
      • HardDepend to address blocks must exist in the yaml, it does not need to be a superblock in the yaml. By default there is a placeholder root.
      • SoftDepend is for ordering in DAG topological sort. It is not required to be in the yaml, but if it does, the ConfigBlock will be queued behind them.
      • ApplyTo to address what this block has to be under. By default set to HardDepend which links to root most of the time.
    •    # Here I show the FactoryBlock attributes
         class FactoryBlock(alg, algName, options, defaults, hardDepend, softDepend, fullname):
             self.uniqueID = str(uuid.uuid4())[:8]
             self.alg = alg
             self.fullName = fullName
             self.algName = algName
             self.options = options
             self.defaults = defaults if defaults else {}
             self.subConfigBlock = [] # For dynamical linking as DAG, store as a dict or a list is a quesiton
             self.softDepend = softDepend if softDepend else []
             self.hardDepend = hardDepend if hardDepend else ['root']
    • Things changed in FactoryBlock is uniqueID, fullName, and hard/softDepend. I renamed subConfigBlock and changed it to a list (of uniqueID)
    • The fullName is in the format of <super>.<algName>, root.Jets, Electron.WorkingPoint. The name also implies what can this FactoryBlock possibly applies to. This will be passed as a key to register in the catalog with addAlgConfigBlock. The naming is used to link the yaml naming convention to the correct ConfigBlock. Therefore, there indeed will have Electrons.ptEtaSelection, Photons.ptEtaSelection with some cost of duplication, which also exist in the old design.
    • subConfigBlock are the sub-blocks in the yaml. They will be stored as a unique ID instead of just the full name to avoid pointing to a block live outside the scope but sharing the same naming.

The flow of the yaml parsing is much simpler now.

  1. First setup a ConfigFactory and register all the available config and its dependency.
  2. When reading the yaml, it records tuple (name, FactoryBlock, config) to TextConfig.config_dag_entries.
  3. Then it cleans up the nested subblock config through checking the registration to avoid clean up dictionary-like options but not a ConfigBlock.
  4. It applies a DAG-topological sort according to the FactoryBlock.subConfigBlock and FactoryBlock.softDepend. The Topological sort is so powerful at the same time having a weakness. I added code to shuffle the list of FactoryBlock, it runs without problem most of the time. But the weakness is that the dependency must be 100% clear in addAlgConfigBlock without any ambiguity, which I didn't foresee. I found there must be some soft dependency I missed right now and causing its failure sometimes after shuffle. Otherwise it is working in logical sense.
  5. After loading it runs a hard-dependency check.

Here is what will happen if you run TextConfig.configure()

  1. It propagates ['containerName', 'skipOnData', 'skipOnMC', 'onlyForDSIDs'] according to the structure in Yaml which encoded in FactoryBlock.subConfigBlock, if they are nested in Yaml, they will be propagated within reason. This propagation is non-recursive as they are now linked as if a ListNodes.
  2. It runs FactoryBlock.makeConfig(config) on each block, there will be excess options checking within each Block.
  3. Add the returns value to ConfigSequence

As the interface is not touched, it runs on both Athena and AnalysisBase without problem (with the CPRun I wrote 😄 ). But more tests are still needed. This implementation must fail the ConfigText_unitTest because the test sorts the ConfigSequence by classname, where Electron.ptEtaSelections may be comparing with Photons.ptEtaSelections as both of them has classname ptEtaSelection. The test only works if we have exact ordering, but we do not. Even so, it does not mean the topological sort ordering is wrong, it just means there are multiple ordering will result in the same root file.

I am here to ask for suggestions and comments!

Merge request reports

Loading