Skip to content

Draft: Lamarr: Interface Gaussino to SQLamarr

Lucio Anderlini requested to merge landerli_lamarr into master

/cc @gcorti @adavis @kreps @mimazure @clemenci

Introduction

Lamarr is the ultra-fast simulation approach developed within LHCb. The potential of such an approach was demonstrated at ICHEP 2022. During the last year, a major rewrite of Lamarr has taken place.

Some highlights:

In this MR, we focus on the upgrade of the framework part. The Lamarr infrastructure was redesigned with the following principles/specifications in mind:

  • The pipeline obtained by enqueing multiple parametrizations should be portable and runnable independently of Gaudi and, possibly, even by ROOT; the effort for deploying machine-learning models in C++ applications, with extremely small latency and with controlled precision-loss requires an intense activity of profiling and debugging based on statistical distributions. Coupling this with the complexity of the Gaudi framework was one of the major causes of delays in the development of the first-generation Lamarr parametrizations. Ideally, one may want to embed a freshly trained algorithm in a pipeline and check the results immediately. Unfortunately the facilities used to train models can hardly be configured to access cvmfs and even installing ROOT is sometimes tedious.
  • The exact same code running in the stand-alone pipeline should be exportable and runnable in Gaussino. Clearly testing and profiling a pipeline to discover it breaks once deployed in the final envioronment would be frustrating, so we wish a tool that can run exactly the same code either as a stand-alone pipeline or inside Gaussino.
  • Everything is a configuration. We aim at having a small code-base of performant building blocks implementing generic data transformation abstractions which can be configured, without recompiling Gaussino, to change the pipeline of parametrizations.

Technological survey for the representation of a flexible EventModel

In order not to depend on Gaudi, we need to replace the concept of EventStore with some other technology enabling the flexibility we aim at. We considered the following alternatives:

  • Apache Arrow: a columnar memory format implementing cross-table relations and accessible from both C++ and Python. Arrow seems to provide impressive performance, but it is a relatively young project and is not available on the LCG. In addition, Arrow does not define any kind of scripting language that can ease the configuration of the parametrization pipelines, which would require to develop our own, possibly based on Gaudi Configurables, with the addition of a significant effort. Overall, Apache Arrow seems a great project to follow with attention, especially for its hardware acceleration capabilities, but not tailored on our needs;
  • DuckDB is an in-memory database system which can be installed as a header-only library, solving all the distribution issues related to Apache Arrow. It is optimized for Online analytical processing (OLAP) which make it great for processing large batches of events, but might be suboptimal for single-event batches we aim at using when interfaced to Gaudi. It implements an SQL-based scripting language that would ease exporting pipelines from a Python, ROOT-less environment to Gaussino. After some investigation, however, we observed that the C++ APIs are not mature enough to rely on them for such a critical task as memory management for ultra-fast simulations;
  • SQLite3 is a C library implementing a thread-safe database engine, providing a full-featured SQL dialect to interact with data. SQLite3 is optimized for On-Line Transaction Processing (OLTP) workloads which makes it less effective for processing large batches of data, where data vectorization and columnar data formats play a most important role, but which is better suited for small batches as those represented by a single event, when Lamarr runs deployed in Gaudi. Two aspects of SQLite3 that we considered carefully are the maturity and stability of the project and the committment from its community to maintain and develop it for the next 25 years. Unfortunately, SQLite does not provide C++ APIs, but only C APIs. SQLite3 also provides an effective persistency format which can be easily interfaced with both pandas (in Python environments) and with ROOT (in Gaussino), see for example TSQLiteServer and TTreeSQL. Also, SQLite3 is a standard dependency for Python, ROOT and a number of other widely adopted software projects.

The three projects discussed above are open source and released with permissive liceses.

SQLamarr

Considering the above, we chose to go with SQLite3, and named the low-level building blocks shared between Python and Gaussino SQLamarr.

SQLamarr also provide extensions to the SQL dialect of SQLite3 defining functions specific to particle physics such as the definition of pseudorapidity.

Compatible-C parametrizations, compiled as shared objects, are dynamically linked to SQLamarr via Standard C dlfcn.h. The path to the compiled parametrization and the linking symbols are passed as part of the configuration of the Plugin and GenerativePlugin building blocks. Compatible-C parametrizations for a subset of trained machine-learning models can be easily obtained with the scikinC package. An alternative package (not maintained by us, though) is keras2c. More complex models may require writing custom C implementations, as done for example for the tracking parametrizations.

Python bindings based on the Standard Python ctypes module (hence, without any additional dependency on Python, nor any dependency of the Python application from non-standard packages) are also provided as part of the library, but they are all collected in a single object that can be removed from the CMakeList.txt file to avoid exposing the related C linking symbols.

SQLamarr as a dependency for Gaussino: a temporary solution

The dependency of Gaussino on SQLamarr is currently implemented using the concept of Git submodule: a symbolic link to the SQLamarr repository is placed in Sim/Lamarr/SQLamarr and gets populated by the CMake configuration step by pulling a tagged version of SQLamarr at compile time. In this phase of the development, this approach ease extremely frequent updates of SQLamarr to follow the development (and bugfix requests) originated by the development of the interface between SQLamarr and Gaudi, in Gaussino. Before merging, one may consider replacing the SQLamarr submodule with a hard copy of the SQLamarr repository.

To achieve this, we propose to modify:

  • CMakeLists.txt, adding the lines
    # Update Git submodules used for LHCb-first, still external, dependencies
    gaussino_update_submodules()
  • cmake/GaussinoConfigUtils.cmake, appending the function
    function(gaussino_update_submodules)
      ## Taken from CMake documentation: https://cliutils.gitlab.io/modern-cmake/chapters/basics/programs.html
      message("Ensure git submodules are available and initialized")
      find_package(Git QUIET)
    
      if(GIT_FOUND AND EXISTS "${PROJECT_SOURCE_DIR}/.git")
        execute_process(COMMAND ${GIT_EXECUTABLE} submodule update --init --recursive
          WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
          RESULT_VARIABLE GIT_SUBMOD_RESULT)
        if(NOT GIT_SUBMOD_RESULT EQUAL "0")
          message(FATAL_ERROR "git submodule update --init --recursive failed with ${GIT_SUBMOD_RESULT}, please checkout submodules")
        endif()
      endif()
    endfunction()

At configuration time, CMake ensure the sources files specific to the stand-alone version of SQLamarr are not present in the SQLamarr module, as defined in Sim/Lamarr/CMakeLists.txt.

## Configuration of SQLamarr package for being integrated in Gaussino:
## 1. Make CMake aware of SQLite header files
set(LAMARR_INCLUDE_DIR ${PROJECT_SOURCE_DIR}/Sim/Lamarr/SQLamarr)
include_directories(${LAMARR_INCLUDE_DIR}/include)

## 2. Remove stand-alone specific source files
file(REMOVE
  SQLamarr/CMakeLists.txt
  SQLamarr/src/main.cpp
  SQLamarr/src/python_bindings.cpp
)

Pipeline configuration in XML

When we started designing SQLamarr, the idea was to configure the building blocks through the Gaudi configuration system. Unfortunately, it turned out that while Python dictionaries of strings are properly interpreted by Gaudi as std::map<std::string, std::string>, Python lists of dictionary cannot be interpreted as std::vector<std::map<std::string, std::string>>. To workaround this limitation, we are passing the pipeline configuration as a single string, representing the pipeline configuration as serialized in XML format.

As you may imagine, in 2023 XML is not the first idea one has when it comes to serialize strings and numbers. However, XML parsing is provided by ROOT TXMLEngine without any additional dependencies for Gaussino.

The serialization of the pipeline configuration is defined in the PyLamarr package. The de-serialization of the XML configuration is defined in this MR, in files Sim/Lamarr/src/Components/ConfigurationParser.{h,cpp}.

On future developments in this area

We have not explored the alternative of having a different GaudiAlgorithm per SQLamarr building block because that would require a more aggressive modification of the GaussinoSimulation configuration, but it can be explored in the future to get rid of XML files.

Also, currently we are re-configuring the pipeline (including parsing the XML string) for each event. The overehead seems under control, but it is still wasteful. To avoid re-configuration, some clever thread-safe caching would be needed.

Performance tuning is not for this MR, though.

Random number generation and seeding

Several parametrizations defined in SQLamarr require random number generators. A differently-seeded PNRG is associated to each DB connection, which in the current implementation means to each event. By default, SQLamarr relies on the std::ranlux48 generator provided by the C++ STL.

Following the example of the Particle Gun generator, the random number generator is seeded with the Cantor Pairing of run- and event-numbers. Only the lower 32 bits are used for seeding, while the highest 32 bits are discarded.

PyLamarr

PyLamarr is a pure-python project designed to configure pipelines. If SQLamarr is properly installed, PyLamarr can also execute the pipelines in stand-alone mode. PyLamarr can also export the pipeline configuration to XML.

The documentation of PyLamarr is incomplete. The package is still evolving too much for this being a primary concern.

Remote resources

To avoid relying on cvmfs for distributing the parametrizations through data packages, PyLamarr defines parametrization files as RemoteResources which are downloaded on-demand and cached locally. TODO: when exporting the pipeline, the path of the locally cached dependencies gets hardcoded in the XML configuration. While effective for local tests, this would break Lamarr when running on Gaussino in any other context where PyLamarr was not run before. We need a better RemoteResource resolution mechanism that allows to define both the cvmfs and the http resource location and falls back on https only in case cvmfs is not available.

LamarrCollector and LamarrTuple

To ease the debugging and validation of the Lamarr integration with Gaussino, the concepts of MCCollector and MCTuple were ported to Sim/Lamarr/src/Components/LamarrCollector.cpp and Sim/Lamarr/src/Components/LamarrTuple.{h,cpp}.

The LamarrCollector algorithm can be configured passing a Tables property that expects a mapping of a string to a string where the key represent the name of an output TTree stored in the LamarrCollector TDirectory, and the value is a fully-featured SQLite query selecting columns. The names of the columns obtained from the SELECT query associated to each table define the names of each branch. Branch types are limited to: int, double and text. Missing values (represented in SQLite as NULL) are converted to a int errorcode configured with the property ErrorCode. All TTrees feature a batch_id column representing a unique identifier obtained by cantor-pairing the run number and the event number. While a batch is conceptually different from an event, in that it is the set of events on which foreign keys representing relations between different tables are valid, to cope with Gaudi hard decoupling of events single-event batches are adopted.

To enable running in multithreaded mode, we introduced a std::mutex in the LamarrCollector. This might be interesting for @mimazure of MCCollector as well?

Example.

The following snippet configures LamarrCollector to generate an nTuple with two TTrees, named MCParticles and Events. The MCParticles TTree will feature four branches: px, py, pz as obtained from the query, plus the batch_id branch added by default to each event. Similarly, the Events TTree will feature three branches: run_number, event_number and batch_id. Note the use of the AS keyword to decouple the naming scheme in the SQL table and in the TTree representations.

 from Configurables import LamarrCollector
 lamarr_collector = LamarrCollector(
   Tables = {
    'MCParticles': "SELECT px, py, pz FROM MCParticles",
    'Events': "SELECT run_number, evt_number AS event_number FROM DataSources",
   }
 )

GaussinoSimulation Configurable and Simulation backend

We modifed the ConfigurableUser of GaussinoSimulation() introducing the Backend slot. By default, the Backend is set to GaussinoSimulation.backend.GEANT and configure a standard Simulation phase relying on Geant4 and Gaussino Geometry configuration. By setting GaussinoSimulation().Backend = GaussinoSimulation.backend.LAMARR, however the Simulation phase disable Geant (as for Generation-only configurations) and the Geometry configuration and ensure Lamarr is part of the execution sequence.

The configuration of "Lamarr instead of Geant" in the Simulation phase reflects the configuration of "ParticleGun instead of Pythia" in the generation phase.

In practice, we modified the __apply_configuration__ entry point to disable the geometry and call the _configure_lamarr() method if the backend is set to "LAMARR".

    def __apply_configuration__(self):
        if GaussinoSimulation.only_generation_phase:
            log.debug("-> Only the generation phase, skipping.")
        elif self.getProp("Backend") == self.backend.LAMARR:
            log.debug("-> Parametric simulation, skipping Geant config")
            print ("Disabling Gaussino Geometry")
            GaussinoGeometry().only_generation_phase = True
            self._configure_lamarr()
        else:
            self._set_giga_service()
            self._set_giga_alg()
            self._set_physics()
            self._set_truth_actions()

            # ensure GaussinoGeometry is enabled
            GaussinoGeometry()

The _configure_lamarr method is summarized below.

    def _configure_lamarr(self):
        from Configurables import Lamarr, ApplicationMgr

        if "Lamarr" not in Lamarr.configurables:
            msg = (
                "The simulation backend is set to use Lamarr, but no "
                "Lamarr() configurable was registered! Make sure to include "
                "all the required tools!"
            )
            log.error(msg)
            raise AttributeError(msg)

        ApplicationMgr().TopAlg += [Lamarr()]

Output format

The names for the tables in the SQLite database representing the event model, and of their columns, is completely defined by the pipeline configuration. This results into a highly non-standard database definition scheme that may make maintenance of data converters from the SQLite data format to experiment Event Model very very difficult to maintain.

We should consider including as last step in the pipeline to remap the tables obtained from the parametrizations into other tables defined according to the EDM4hep. This would imply modification at configuration level (as of today, in the pipeline.xml file) but no modification to the C++, and would make our lives easier when moving to Gauss to implement data converters.

We do not aim at a conversion to EDM DataModel for this MR, though.

Test runs (success is "it compiles and does not crash")

Succesful tests so far:

  • 100 events with ParticleGun and Lamarr in Single Thread mode
  • 100 events with Pythia in single thread mode, Minimum Bias
  • 100 000 events with Pythia8 and Lamarr in 32 Threads (Pythia in thread-local mode, or GaussinoGeneration().ProductionTool="Pythia8ProductionMT"), Minimum Bias
  • Pythia8 and EvtGen producing some b hadrons
  • ParticleGun and EvtGen producing some b hadrons

Validation runs (success is "numerical results are not obviosuly wrong")

  • TODO

Unit tests

  • TODO

Updated documentation

We drafted a docs/examples/lamarr.md tutorial on running Lamarr. Plan is to expand it to include a tutorial on creating custom parametrizations.

Preview: https://gaussino.docs.cern.ch/landerli_lamarr/examples/lamarr.html

TL;DR

  • Edit the GaussinoSimulation configuration system to enable switching from Geant to Lamarr as "Simulation Backend";
  • Implement a first draft of Lamarr, interfacing Gaussino to SQLamarr;
  • Discuss and possibly abandon the Git submodule mechanism to manage the dependency of Gaussino from SQLamarr;
  • Develop a configuration system enabling exporting and importing pipelines from Python to Gaussino;
  • Consider better solutions for the configuration system not relying on XML;
  • Extend the pipeline to include tables according to the EDM4hep schema;
  • Run a first batch of tests, including multithreading, with Pythia and PGuns;
  • Run validation runs with generators producing heavy hadrons (and including EvtGen);
  • Extend the resolution-mechanism of PyLamarr RemoteResources to download resources from https only upon misses on cvmfs;
  • Improve the documentation of PyLamarr;
  • Draft a tutorial for using Lamarr-on-Gaussino;
  • Draft a tutorial on how to create a custom (simple) parametrization;
  • Add unit tests
Edited by Lucio Anderlini

Merge request reports