Daniel Hugo Campora Perez requested to merge dcampora_nnolte_multi_evt_scheduler_v2 into master Aug 13, 2020

A multi-event-scheduler

What's that

This is a scheduler can execute a control and dataflow configuration, similarly to HLTControlFlowMgr in LHCb.
(see https://iopscience.iop.org/article/10.1088/1742-6596/1525/1/012052/pdf if you are not familiar)

Sharing functionality

Much of the functionality that we introduce here is already existent in Moore and the CPU HLT. Specifically how control and dataflow is set up in the configuration: CompositeNodes define control flow constraints and data flow constraints are defined in the background by requiring that a producer of data runs before its consumer. We have ripped out Moore/PyConf functionality for that (MiniPyConf here), specifically the data_flow and components modules.

Overall, Allen configuration looks similar to Moore configuration with this setup.

How does it work

From the CompositeNode tree that defines the application, we first extract execution masks for all algorithms. Consider a simple tree: You would like to run algorithms A,B,C in a lazy fashion, with a connecting OR. The masks that the scheduler extracts from this tree are given in the Leafs. B only has to run in case A did not pass, and C only if B and A did not pass.
For algorithms that appear multiple times, we merge the execution lists with an ANY relationship
we simplify the boolean mask expressions using a solver (sympy.simplify)
we gather data and control flow dependencies:
- data dependencies are found by backtracing inputs (PyConf feature)
- control flow dependencies are extracted by parsing the simplified boolean expression (from 3.) back into control flow trees and extracting the algorithms.
we order algorithms
- data dependencies serve as constraints
- control flow dependencies serve as soft constraints
- we might find ourselves in a position where data & control flow dependencies cannot all be accounted for, in which case we loosen the control flow constraints by loosening the execution mask for one of the algorithms that is insertable according to data dependencies. Generic mask loosening is done by substituting an algorithm in a mask by True or False and then simplifying again. Example: (A & B) -> loosen by B -> (A & True) | (A & False) -> A
we receive an ordered collection of algorithms with their respective execution masks
For every unique, nontrivial execution mask we build a combiner algorithm that is inserted in the sequence right before the first algorithm with that execution mask
The allen executable sequence is generated and compiled
Execution works as follows:
- Algorithms with execution mask True are executed on every event
- Execution is governed by event lists: Algorithms execute on every event in the event list that they get as input
- Algorithms that can reduce the event list, like the GEC, export one as output, which is then consumed by algorithms with GEC as execution mask
- For more complicated masks, like GEC & BLUB, the combiner algorithms take care of event list union(OR) / intersection(AND) / inversion(NOT) before the algorithm actually executes

Algorithm order optimizations

With every control flow tree there are multiple possible orderings that the scheduler might consider, and the throughput depend on these orders.

There are two types of order swaps that one might consider: A lazy control flow node that does not require a specifc order is one where swapping orders yields different execution masks for different algorithms. As simple heuristic, one can assume that more expensive algorithms are better associated with sparse execution masks.

Defining how expensive an algorithm is and how sparse an execution mask is, is a highly non-trivial task. In fact, doing so perfectly requires actually running the application in all possible orders, which is something we would like to avoid. Currently, we hardcode educated guesses for the weight of an algorithm execution. Some weights are taken from a profile run. Some other heuristics help in setting weights that result in acceptable sequences, like the fact that data providers should be spread out as far as possible to not create io bottlenecks.

In Summary, trying to model accurate execution weights and average efficiencies for algorithms in this heterogenous architechture seems like a bad idea. Instead, it might make more sense to employ optimization algorithms that operate over possible orderings and test each ordering with a quick benchmark, automatically. We expect this procedure to take a long time to complete, but maybe we don't have to optimize on such a high level for too many configurations. Ideas include genetic algorithms or simulated annealing. (none of these are implemented yet, but thats not the scope of this MR anyway)

Built on top of !393 (merged)

TODO:

Some more tests for the python core functionality, specifically cftree_ops.py and event_list_utils.py
Merge minipyconf and pyconf (done in LHCb!2964 (merged))
Cleanup of the physics configurations (part of this MR)
Check that every host_datatype is only assigned to host_datatypes, and that every device_datatype is only assigned to device_datatypes.
Update documentation.
Once LHCb!2964 (merged) is merged, change GenerateConfiguration.cmake to use HEAD instead of a branch of LHCb.

List of changes

One general remark: this MR changes the code generation steps of Allen, but otherwise it does minimal changes to the rest of the codebase. That means that for the most part, with the exception of the introduction of MASK_ types and the event list intersection, union and inversion, all headers / sources are not modified.

Here is a list of requirements and changes introduced to Allen as part of this MR:

Pregenerated sequences are removed. Python 3 and libClang are now requirements.
git is required in STANDALONE to be able to fetch PyConf from LHCb.
The option SEQUENCE_GENERATION is therefore gone as well, and so is its CI job.
The obsolete gaudi configurations and previous configurations are gone.
The following directory structure has been created: AllenConf contains an "extension" to PyConf to enable Multi Event Scheduling, sequences contains all sequences, sequences/definitions contains definition files used by the sequences, and tests includes MES checks.
The following configurations exist and can be therefore passed to the cmake SEQUENCE option:

|-- sequences
|   |-- forward.py
|   |-- hlt1_complex_validation.py
|   |-- hlt1_pp_default.py
|   |-- hlt1_pp_no_gec.py
|   |-- hlt1_pp_no_gec_validation.py
|   |-- hlt1_pp_non-restricted_UT.py
|   |-- hlt1_pp_scifi_v6.py
|   |-- hlt1_pp_scifi_v6_validation.py
|   |-- hlt1_pp_validation.py
|   |-- muon.py
|   |-- pv.py
|   |-- velo.py
|   `-- veloUT.py

sequences/definitions files have been refactored heavily. Files now identify with subdetector reconstructions (eg. velo_reconstruction.py, ut_reconstruction.py), there are different files for lines (eg. hlt1_technical_lines.py, hlt1_muon_lines.py), validators, persistency, and so on.
A number of python tests have been added to run as part of the Allen CI to test the MES functionality.
A complex sequence test that runs various instances of reconstruction algorithms has been added.
Combiner algorithms event_list_intersection, event_list_union and event_list_inversion have been added.
All event list arguments have become MASK_INPUT or MASK_OUTPUT. There can be at most a single MASK_INPUT and a single MASK_OUTPUT parameter per algorithm, which are used internally by the MES and don't need to be configured as part of the algorithms.

Should be merged after LHCb!2964 (merged) (and should be tested with it too).

Edited Aug 31, 2021 by Giacomo La Scala

Multi-event scheduler