Draft: Lamarr: Interface Gaussino to SQLamarr
/cc @gcorti @adavis @kreps @mimazure @clemenci
Introduction
Lamarr is the ultra-fast simulation approach developed within LHCb. The potential of such an approach was demonstrated at ICHEP 2022. During the last year, a major rewrite of Lamarr has taken place.
Some highlights:
- new methods for parametrizing RICH response with explicit Lipshitz constraints on the GAN discriminator SimDev presentation
- new tracking parametrizations Live docs
- distributed training on multiple Cloud providers with Hopaas
- studies on how to face the particle-to-particle correlation problem, especially relevant for ECAL
In this MR, we focus on the upgrade of the framework part. The Lamarr infrastructure was redesigned with the following principles/specifications in mind:
- The pipeline obtained by enqueing multiple parametrizations should be portable and runnable independently of Gaudi and, possibly, even by ROOT; the effort for deploying machine-learning models in C++ applications, with extremely small latency and with controlled precision-loss requires an intense activity of profiling and debugging based on statistical distributions. Coupling this with the complexity of the Gaudi framework was one of the major causes of delays in the development of the first-generation Lamarr parametrizations. Ideally, one may want to embed a freshly trained algorithm in a pipeline and check the results immediately. Unfortunately the facilities used to train models can hardly be configured to access cvmfs and even installing ROOT is sometimes tedious.
- The exact same code running in the stand-alone pipeline should be exportable and runnable in Gaussino. Clearly testing and profiling a pipeline to discover it breaks once deployed in the final envioronment would be frustrating, so we wish a tool that can run exactly the same code either as a stand-alone pipeline or inside Gaussino.
- Everything is a configuration. We aim at having a small code-base of performant building blocks implementing generic data transformation abstractions which can be configured, without recompiling Gaussino, to change the pipeline of parametrizations.
Technological survey for the representation of a flexible EventModel
In order not to depend on Gaudi, we need to replace the concept of EventStore with some other technology enabling the flexibility we aim at. We considered the following alternatives:
- Apache Arrow: a columnar memory format implementing cross-table relations and accessible from both C++ and Python. Arrow seems to provide impressive performance, but it is a relatively young project and is not available on the LCG. In addition, Arrow does not define any kind of scripting language that can ease the configuration of the parametrization pipelines, which would require to develop our own, possibly based on Gaudi Configurables, with the addition of a significant effort. Overall, Apache Arrow seems a great project to follow with attention, especially for its hardware acceleration capabilities, but not tailored on our needs;
- DuckDB is an in-memory database system which can be installed as a header-only library, solving all the distribution issues related to Apache Arrow. It is optimized for Online analytical processing (OLAP) which make it great for processing large batches of events, but might be suboptimal for single-event batches we aim at using when interfaced to Gaudi. It implements an SQL-based scripting language that would ease exporting pipelines from a Python, ROOT-less environment to Gaussino. After some investigation, however, we observed that the C++ APIs are not mature enough to rely on them for such a critical task as memory management for ultra-fast simulations;
-
SQLite3 is a C library implementing a thread-safe database engine, providing a full-featured SQL dialect to interact with data. SQLite3 is optimized for On-Line Transaction Processing (OLTP) workloads which makes it less effective for processing large batches of data, where data vectorization and columnar data formats play a most important role, but which is better suited for small batches as those represented by a single event, when Lamarr runs deployed in Gaudi. Two aspects of SQLite3 that we considered carefully are the maturity and stability of the project and the committment from its community to maintain and develop it for the next 25 years. Unfortunately, SQLite does not provide C++ APIs, but only C APIs. SQLite3 also provides an effective persistency format which can be easily interfaced with both
pandas
(in Python environments) and with ROOT (in Gaussino), see for example TSQLiteServer and TTreeSQL. Also, SQLite3 is a standard dependency for Python, ROOT and a number of other widely adopted software projects.
The three projects discussed above are open source and released with permissive liceses.
SQLamarr
Considering the above, we chose to go with SQLite3, and named the low-level building blocks shared between Python and Gaussino SQLamarr
.
SQLamarr also provide extensions to the SQL dialect of SQLite3 defining functions specific to particle physics such as the definition of pseudorapidity.
Compatible-C parametrizations, compiled as shared objects, are dynamically linked to SQLamarr
via Standard C dlfcn.h
. The path to the compiled parametrization and the linking symbols are passed as part of the configuration of the Plugin
and GenerativePlugin
building blocks.
Compatible-C parametrizations for a subset of trained machine-learning models can be easily obtained with the scikinC
package. An alternative package (not maintained by us, though) is keras2c. More complex models may require writing custom C implementations, as done for example for the tracking parametrizations.
Python bindings based on the Standard Python ctypes
module (hence, without any additional dependency on Python, nor any dependency of the Python application from non-standard packages) are also provided as part of the library, but they are all collected in a single object that can be removed from the CMakeList.txt
file to avoid exposing the related C linking symbols.
SQLamarr as a dependency for Gaussino: a temporary solution
The dependency of Gaussino on SQLamarr is currently implemented using the concept of Git submodule: a symbolic link to the SQLamarr
repository is placed in Sim/Lamarr/SQLamarr
and gets populated by the CMake configuration step by pulling a tagged version of SQLamarr
at compile time.
In this phase of the development, this approach ease extremely frequent updates of SQLamarr to follow the development (and bugfix requests) originated by the development of the interface between SQLamarr and Gaudi, in Gaussino. Before merging, one may consider replacing the SQLamarr submodule with a hard copy of the SQLamarr repository.
To achieve this, we propose to modify:
-
CMakeLists.txt
, adding the lines# Update Git submodules used for LHCb-first, still external, dependencies gaussino_update_submodules()
-
cmake/GaussinoConfigUtils.cmake
, appending the functionfunction(gaussino_update_submodules) ## Taken from CMake documentation: https://cliutils.gitlab.io/modern-cmake/chapters/basics/programs.html message("Ensure git submodules are available and initialized") find_package(Git QUIET) if(GIT_FOUND AND EXISTS "${PROJECT_SOURCE_DIR}/.git") execute_process(COMMAND ${GIT_EXECUTABLE} submodule update --init --recursive WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} RESULT_VARIABLE GIT_SUBMOD_RESULT) if(NOT GIT_SUBMOD_RESULT EQUAL "0") message(FATAL_ERROR "git submodule update --init --recursive failed with ${GIT_SUBMOD_RESULT}, please checkout submodules") endif() endif() endfunction()
At configuration time, CMake ensure the sources files specific to the stand-alone version of SQLamarr are not present in the SQLamarr module, as defined in Sim/Lamarr/CMakeLists.txt
.
## Configuration of SQLamarr package for being integrated in Gaussino:
## 1. Make CMake aware of SQLite header files
set(LAMARR_INCLUDE_DIR ${PROJECT_SOURCE_DIR}/Sim/Lamarr/SQLamarr)
include_directories(${LAMARR_INCLUDE_DIR}/include)
## 2. Remove stand-alone specific source files
file(REMOVE
SQLamarr/CMakeLists.txt
SQLamarr/src/main.cpp
SQLamarr/src/python_bindings.cpp
)
Pipeline configuration in XML
When we started designing SQLamarr
, the idea was to configure the building blocks through the Gaudi configuration system.
Unfortunately, it turned out that while Python dictionaries of strings are properly interpreted by Gaudi as std::map<std::string, std::string>
, Python lists of dictionary cannot be interpreted as std::vector<std::map<std::string, std::string>>
.
To workaround this limitation, we are passing the pipeline configuration as a single string, representing the pipeline configuration as serialized in XML format.
As you may imagine, in 2023 XML is not the first idea one has when it comes to serialize strings and numbers. However, XML parsing is provided by ROOT TXMLEngine
without any additional dependencies for Gaussino.
The serialization of the pipeline configuration is defined in the PyLamarr package.
The de-serialization of the XML configuration is defined in this MR, in files Sim/Lamarr/src/Components/ConfigurationParser.{h,cpp}
.
On future developments in this area
We have not explored the alternative of having a different GaudiAlgorithm per SQLamarr building block because that would require a more aggressive modification of the GaussinoSimulation
configuration, but it can be explored in the future to get rid of XML files.
Also, currently we are re-configuring the pipeline (including parsing the XML string) for each event. The overehead seems under control, but it is still wasteful. To avoid re-configuration, some clever thread-safe caching would be needed.
Performance tuning is not for this MR, though.
Random number generation and seeding
Several parametrizations defined in SQLamarr require random number generators.
A differently-seeded PNRG is associated to each DB connection, which in the current implementation means to each event.
By default, SQLamarr relies on the std::ranlux48
generator provided by the C++ STL.
Following the example of the Particle Gun generator, the random number generator is seeded with the Cantor Pairing of run- and event-numbers. Only the lower 32 bits are used for seeding, while the highest 32 bits are discarded.
PyLamarr
PyLamarr
is a pure-python project designed to configure pipelines.
If SQLamarr
is properly installed, PyLamarr
can also execute the pipelines in stand-alone mode.
PyLamarr
can also export the pipeline configuration to XML.
The documentation of PyLamarr is incomplete. The package is still evolving too much for this being a primary concern.
Remote resources
To avoid relying on cvmfs for distributing the parametrizations through data packages,
PyLamarr defines parametrization files as RemoteResources
which are downloaded on-demand and cached locally.
TODO: when exporting the pipeline, the path of the locally cached dependencies gets hardcoded in the XML configuration. While effective for local tests, this would break Lamarr when running on Gaussino in any other context where PyLamarr was not run before.
We need a better RemoteResource resolution mechanism that allows to define both the cvmfs and the http resource location and falls back on https only in case cvmfs is not available.
LamarrCollector
and LamarrTuple
To ease the debugging and validation of the Lamarr integration with Gaussino, the concepts of MCCollector
and MCTuple
were ported to Sim/Lamarr/src/Components/LamarrCollector.cpp
and Sim/Lamarr/src/Components/LamarrTuple.{h,cpp}
.
The LamarrCollector
algorithm can be configured passing a
Tables
property that expects a mapping of a string to a string where the
key represent the name of an output TTree stored in the LamarrCollector
TDirectory, and the value is a fully-featured SQLite query selecting
columns.
The names of the columns obtained from the SELECT query associated to each
table define the names of each branch.
Branch types are limited to: int, double and text.
Missing values (represented in SQLite as NULL) are converted to a int
errorcode configured with the property ErrorCode.
All TTrees feature a batch_id
column representing a unique identifier
obtained by cantor-pairing the run number and the event number.
While a batch is conceptually different from an event, in that it is the
set of events on which foreign keys representing relations between
different tables are valid, to cope with Gaudi hard decoupling of events
single-event batches are adopted.
To enable running in multithreaded mode, we introduced a std::mutex in the LamarrCollector. This might be interesting for @mimazure of MCCollector as well?
Example.
The following snippet configures LamarrCollector to generate an nTuple
with two TTrees, named MCParticles
and Events
. The MCParticles
TTree
will feature four branches: px
, py
, pz
as obtained from the query, plus
the batch_id
branch added by default to each event.
Similarly, the Events
TTree will feature three branches: run_number
,
event_number
and batch_id
.
Note the use of the AS keyword to decouple the naming scheme in the SQL
table and in the TTree representations.
from Configurables import LamarrCollector
lamarr_collector = LamarrCollector(
Tables = {
'MCParticles': "SELECT px, py, pz FROM MCParticles",
'Events': "SELECT run_number, evt_number AS event_number FROM DataSources",
}
)
GaussinoSimulation
Configurable and Simulation backend
We modifed the ConfigurableUser
of GaussinoSimulation()
introducing the Backend
slot.
By default, the Backend
is set to GaussinoSimulation.backend.GEANT
and configure a standard Simulation phase relying on Geant4 and Gaussino Geometry configuration.
By setting GaussinoSimulation().Backend = GaussinoSimulation.backend.LAMARR
, however the Simulation phase disable Geant (as for Generation-only configurations) and the Geometry configuration and ensure Lamarr is part of the execution sequence.
The configuration of "Lamarr instead of Geant" in the Simulation phase reflects the configuration of "ParticleGun instead of Pythia" in the generation phase.
In practice, we modified the __apply_configuration__
entry point to disable the geometry and call the _configure_lamarr()
method if the backend is set to "LAMARR".
def __apply_configuration__(self):
if GaussinoSimulation.only_generation_phase:
log.debug("-> Only the generation phase, skipping.")
elif self.getProp("Backend") == self.backend.LAMARR:
log.debug("-> Parametric simulation, skipping Geant config")
print ("Disabling Gaussino Geometry")
GaussinoGeometry().only_generation_phase = True
self._configure_lamarr()
else:
self._set_giga_service()
self._set_giga_alg()
self._set_physics()
self._set_truth_actions()
# ensure GaussinoGeometry is enabled
GaussinoGeometry()
The _configure_lamarr
method is summarized below.
def _configure_lamarr(self):
from Configurables import Lamarr, ApplicationMgr
if "Lamarr" not in Lamarr.configurables:
msg = (
"The simulation backend is set to use Lamarr, but no "
"Lamarr() configurable was registered! Make sure to include "
"all the required tools!"
)
log.error(msg)
raise AttributeError(msg)
ApplicationMgr().TopAlg += [Lamarr()]
Output format
The names for the tables in the SQLite database representing the event model, and of their columns, is completely defined by the pipeline configuration. This results into a highly non-standard database definition scheme that may make maintenance of data converters from the SQLite data format to experiment Event Model very very difficult to maintain.
We should consider including as last step in the pipeline to remap the tables obtained from the parametrizations into other tables defined according to the EDM4hep. This would imply modification at configuration level (as of today, in the pipeline.xml file) but no modification to the C++, and would make our lives easier when moving to Gauss to implement data converters.
We do not aim at a conversion to EDM DataModel for this MR, though.
Test runs (success is "it compiles and does not crash")
Succesful tests so far:
-
100 events with ParticleGun and Lamarr in Single Thread mode -
100 events with Pythia in single thread mode, Minimum Bias -
100 000 events with Pythia8 and Lamarr in 32 Threads (Pythia in thread-local mode, or GaussinoGeneration().ProductionTool="Pythia8ProductionMT"
), Minimum Bias -
Pythia8 and EvtGen producing some b
hadrons -
ParticleGun and EvtGen producing some b
hadrons
Validation runs (success is "numerical results are not obviosuly wrong")
-
TODO
Unit tests
-
TODO
Updated documentation
We drafted a docs/examples/lamarr.md
tutorial on running Lamarr. Plan is to expand it to include a tutorial on creating custom parametrizations.
Preview: https://gaussino.docs.cern.ch/landerli_lamarr/examples/lamarr.html
TL;DR
-
Edit the GaussinoSimulation configuration system to enable switching from Geant to Lamarr as "Simulation Backend"; -
Implement a first draft of Lamarr, interfacing Gaussino to SQLamarr; -
Discuss and possibly abandon the Git submodule mechanism to manage the dependency of Gaussino from SQLamarr; -
Develop a configuration system enabling exporting and importing pipelines from Python to Gaussino; -
Consider better solutions for the configuration system not relying on XML; -
Extend the pipeline to include tables according to the EDM4hep schema; -
Run a first batch of tests, including multithreading, with Pythia and PGuns; -
Run validation runs with generators producing heavy hadrons (and including EvtGen); -
Extend the resolution-mechanism of PyLamarr RemoteResources to download resources from https only upon misses on cvmfs; -
Improve the documentation of PyLamarr; -
Draft a tutorial for using Lamarr-on-Gaussino; -
Draft a tutorial on how to create a custom (simple) parametrization; -
Add unit tests