CaloRecGPU Offline and Online Configuration and Clean-Up (!69434) · Merge requests · atlas / athena

Nuno Dos Santos Fernandes requested to merge dossantn/athena:CaloRecGPUHLT-improvement into main Mar 02, 2024

Following from the great efforts in updating the CaloRecGPU configuration to a flags-based style (namely !68106 (merged) and !68546 (merged)), we are now able to provide a proper configuration for both offline and online, with extraneous options cleaned up and new ones added as needed, providing what I hope would be a sensible interface that encourages code reuse. In particular, we try to leverage the ActiveConfig convention to unify the offline and online instantiations, which I think is a pretty powerful mechanism that may improve the clarity of the configuration here and elsewhere.

My personal guidelines while creating this configuration:

Flags enable configuring every possible set of behaviours we might want to change (or even just test) in the tools/algorithms to be instantiated.
...Cfg functions can receive **kwargs to potentially override other options directly (using a loop with setattr, NOT by simply forwarding them to the component factory!), to allow one-off instantiations with particular changes in very specific cases if they are needed. (This is essentially equivalent to setting them directly after the popToolsAndMerge, but may be more ergonomic.)
Other arguments may be added to the ...Cfg functions on a case-by-case basis to configure specific things that are more related to how we are instantiating everything (e. g. the name for the component, whether or not we want trigger-style monitoring, etc.) than to how we want the algorithm to behave.
The ActiveConfig infrastructure is leveraged to accommodate different needs without repeating the instantiating code (as any different behaviours should be specifiable by changing the appropriate flags and, in particular cases, the arguments to the ...Cfg)
Flag hierarchy should be consistent. (See Control/AthenaConfiguration/python/AllConfigFlags.py)

Full consensus and enforcement is indispensable, but my personal preference would be:
- flags.<System>.<Stuff> for most things (with potential subcategories when different options for different purposes are required, e. g. the different instantiations we now see of CPU CaloClusterMaker)
- flags.Trig.<System>.<Stuff> in the rare cases where there is a specific trigger setting that relates to said <System>.<Stuff>, or a trigger-specific algorithm, or whatever, that has no offline equivalent
- Trigger-specific configuration of the tools/algorithms should be leveraged through separate sets of flags, e. g. flags.<System>.<Stuff>.Trigger (or equivalent with multiple categories)

Point 5 is the one I admit may cause more contention here, and, as with everything that essentially has strictly to do with organization and semantics, having a full consensus is more important than anything else.

In an effort to make the instantiation of the CaloRecGPU tools and algorithms work in perfect agreement with what CaloRec does, a few changes that lie outside the narrow definition of this configuration are needed:

Prevent CaloClusterSnapshot from dynamic_casting its parent to a CaloClusterMaker to find out the final cluster container name.

This basically meant we could not instantiate a CaloClusterSnapshot as a tool of our CaloGPUHybridClusterProcessor, to match what the CPU equivalent does in CaloRec. A possible solution would be making both CaloClusterMaker and CaloGPUHybridClusterProcessor inherit from a common (virtual) base that would provide the necessary interface, but, given that there are few places where a CaloClusterSnapshot is instantiated, it is much cleaner to add an option to the class and assign the correct value to it at the instantiation site.

This leads to (localized) changes in ~~CaloClusterTopoGetter.py and~~ CaloTopoClusterConfig.py.

Tagging @pavol and @wlampl as per Denis' suggestion.
Add an additional option to the CPU topological cluster growing tool to decide whether the clusters will be cut in transverse energy or in absolute transverse energy

This is motivated in the first place by the fact that, on the GPU, the cluster cuts are applied by a different tool than the one that does cluster growing, so it has always been a distinct option there, especially since floating point accuracy issues meant that studying this cut was necessary at some point, and we are interested in keeping it as such for potential future tests.

Secondly, the CPU (and by extension the GPU) cluster growing already has three distinct options to decide whether the seed, growing and terminal cells are classified using the signal-to-noise ratio or its absolute value; these options are independent and, even though it might make little physical sense, having distinct settings is a theoretical possibility of the current configuration (and something that might be interesting to consider e. g. for debug purposes). However, rather in contrast with this increased freedom of configuration, the option that decided whether the seed cells were classified in absolute value was recycled to also decide whether or not to apply the cluster cut in absolute value, which to me seems rather inconsistent.

Thirdly, and following from the last point, the cluster cut in transverse energy takes into account all the cells that are part of the cluster. Conceptually speaking, it is an entirely different operation from the seed cell classification, which only takes into account the energy and noise of a single cell (and it is theoretically possible that there are some clusters whose overall transverse energy is of a different sign than the energy measurement at its seed cell). Of course, for most purposes, we are interested in considering absolute value for both the cell classification and the transverse energy cut, but in that case the three different options for each of the kinds of cells should also be unified into a single "use absolute value" option. This being a more fundamental change to the algorithm, I think it is outside of the scope of this MR, but it may be worth discussing in the future.

Tagging @menke per Denis' suggestion.

However, the addition of this option does result in quite a few changes outside of the immediate scope of this MR, in the places where the option that controls the seed cell classification in absolute value is set, as the new option that was introduced to control the cluster cuts in transverse energy must agree with it. These correspond to:
- ~~Calorimeter/CaloRec/python/CaloClusterTopoEMGetters.py~~
- ~~DataQuality/DataQualityTools/share/HadTopoClusterMaker.py~~
- ~~Reconstruction/HeavyIonRec/HIJetRec/python/SubtractedCaloClusterTopoGetter.py~~
- Reconstruction/tauRec/python/TauToolHolder.py
- TileCalorimeter/TileMonitoring/python/TileTopoClusterConfig.py
- ~~TileCalorimeter/TileMonitoring/share/TileMonTopoCluster_jobOptions.py~~
I do realize some of those may be on the way to deprecation, but I am unsure whether or not to keep them up to date with our changes to prevent undesirable behaviour regressions. (And they also provide a relevant argument for having the ActiveConfig manage all the different possible configurations instead of having tools instantiated in just slightly different ways across multiple files...)
Add a CPU flag that would allow instantiating the fully CPU configuration with GPU criteria

This would basically make it more convenient to do a comparison of the full configuration and instantiation between CPU and GPU in the future without any changes to the CPU code (barring the setting and use of this flag), instead of the approach used so far, that was to instantiate only the CPU tools with the appropriate options in the GPU-related code.

Some cleanup of older code was also undertaken:

Fixed some errors from compilation paths that are usually not taken (related to CUDA and C++ versions).
Removed deprecated prototype of GPU (non-automaton) splitting

Edited Mar 06, 2024 by Nuno Dos Santos Fernandes

CaloRecGPU Offline and Online Configuration and Clean-Up

Merge request reports