GSoC19 Executing Events in Parallel
This merge request concludes my Google Summer of Code 2019 project: Implement Event-based Seeding and Multi-Threading
Fixes #107 (closed)
This work is based on the previous work in GSOC18 and completes the effort needed to allow Allpix-Squared to run simulation events in parallel.
This is a WIP until:
Remove unnecessary locking for Reader/Writer modulesComplete testing and benchmarkingComplete the documentation and user manual
Please note that the LXPLUS CI build is disabled for now since it uses an outdated version of Geant4 dependencies and it needs to be updated.
All benchmarks, the project proposal, and all additional notes and artifacts created during the project can be found in this external repository. While all slides and presentation for the weekly meetings are present in this Google drive folder. The weekly project timeline can also be seen here
Event-based Multithreading
The new multithreading approach executes the event loop in parallel instead of sequential execution. The level of parallelism has changed to be on the event level instead of the module level which allows for better performance and better scaling on multicore machines.
A detailed account of the major changes to different modules is described below along with the testing plan, benchmarks to compare the current and the new multithreading approaches, performance analysis with vTune profiler and future work
Event Class
For this purpose, the lightweight class Event
is defined to hold all data for an event. The ModuleManager
runs the event loop using the ThreadPool
to execute independent events in parallel. For each event, an Event
object is constructed with a unique number that identifies that specific event.
To ensure simulation reproducibility, each object holds a reference to a random number generator (RNG) to be used exclusively by the modules executing this event by calling getRandomNumber()
or getRandomEngine()
methods. The RNG is seeded with a unique seed generated specifically for each event by the core RNG owned by the ModuleManager
. The seed is drawn from the core RNG and saved with the event object before submitting it to parallel execution and therefore guarantees the exact seed distribution regardless of the number of threads and the event execution order.
To optimize the memory footprint of the Event
class, the storage of the RNG object is static thread local. This way ensures that the framework maintains the minimum number of heavy RNG objects equal to the number of threads used that is needed to execute tasks independently.
Module thread safety
In this new approach, all modules are expected to run multiple events at the same time. Therefore, the module's run
method should be re-entrant and the module author is expected to synchronize access to any shared member variables.
By default, the framework assumes the module doesn't support parallelization to maintain backward compatibility. Similarly, the module's author must specify in the module's constructor that the module is ready for running events in parallel by calling enable_parallelization()
. This is very important to happen in the constructor so that the framework can decide if the current configuration, all the modules, can be run in parallel or not!
If for example, some modules in a given configuration don't support multithreading, the framework will fall back to single thread execution and informs all modules about that decision. As such, the module should use canParallelize()
method for multithreading specific logic.
Writer Modules
Since modules that write to an output file would be interested in a way to write the events in the ascending order of the event number, the BufferedModule
is provided to abstract a buffering mechanism which buffers out of order events and executes them later.
Such modules should inherit from the BufferedModule
and implement the same interface for Module
and they should expect that each time the run
method is called, the events will be in the correct order and can be written to the file without further sorting or storing.
Such modules are restricted not to dispatch messages using the messenger since other modules may have already finished processing the current event.
Messenger Class
The messenger class was split into two classes with different responsibilities. To maintain backward compatibility, the Messenger
class is kept as a single instance object that is passed to modules and used to bind message listeners. However, modules now are expected to fetch the messages by themselves contrary to the previous mechanism of specifying a member variable that is updated when a message is dispatched. The MessageNotFoundException
will be thrown in case a module tries to fetch messages that were not sent.
The public API for the Messenger
has changed and now requires the Event
object as a parameter for each of its calls except when registering a message listener. The reason for that comes from the second class LocalMessenger
which abstracts all logic and the actual storage for messages. Each event has an instance of this new class that maintains modules communication withing that event.
The Messenger
only keeps the information about message subscription which is shared across all the local messengers. Additionally, it now supports subscribing to multiple message types and offers the required mechanisms to check if all required messages for a given module were dispatched.
All calls to Messenger::dispatchMessage
and Messenger::fetchMessage
are handled internally by the LocalMessenger
responsible for the given Event
object and thus the Messenger
transparently works as a single point of communication between modules while it keeps event-specific data stored separately.
ThreadPool
The ThreadPool
now will transparently execute the given tasks on the caller thread if it has no workers. This is the case when running Allpix-Squared without the multithreading option. The ModuleManager
will construct the pool with 0 threads when multithreading is not requested. This change is needed to accommodate for Geant4 requirements as discussed in the Geant4 section.
Furthermore, the thread pool is initialized with a fixed queue size. Once the queue is full, calls to submit a new task will be blocked. This is needed to submit the events in patches to avoid using more memory than is needed.
Geant4
The main focus of the GSoC19 project was to fix the problem of using Geant4 in parallel. More specifically, the provided run managers G4RunManager
and G4MTRunManager
do not fit with the framework's use case because the first can not be called from multiple threads and the second provides internal parallelism by creating its own threads to execute the call to BeamOn
. However, Allpix-Squared requires that all modules should be called by its threads without creating further threads and it is supposed not to have knowledge about Geant4 specific details since in theory a user can configure Allpix-Squared without Geant4 at all.
To overcome these issues, 2 customized run managers for Geant4 library were developed:
-
RunManager
to replaceG4RunManager
-
MTRunManager
to replaceG4MTRunManager
The new run manager MTRunManager
provides the new APIs as follows:
-
void InitializeForThread()
: Must be called by the Allpix-Squared thread that is going to use the manager later on -
void Run(event_number, number_of_particles)
: Multithreaded equivalent ofBeamOn
, the event number is used to choose unique seeds to initialize the RNG specific for this thread. This must not be called by the main thread that initialized the manager! -
void TerminateForThread()
: To be called by each thread to cleanup thread-specific resources
Under the hood, MTRunManager
(master) creates thread local WorkerRunManager
s (workers). These managers share the world geometry but each has its own user hooks and actions. To ensure that simulations can be reproduced, the master manager maintains a random number generator (RNG) that is used to generate a list of random numbers each is associated with a specific event number and is used to initialize the worker's RNG when it is invoked to run that specific event.
While the new MTRunManager
make it possible to call the DepositionGeant4Module
in parallel and remove the specific checks that were adapted earlier to run Geant4 modules, it was not sufficient to make the VisualizationGeant4Module
work since it call's BeamOn
behind the scenes in its finalize()
method on the main thread which is not what MTRunManager
was customized for. To fix this issue, another custom manager was developed RunManager
which is nothing but minimal customization over G4RunManager
to use the same RNG initialization technique as adapted by the MTRunManager
to ensure that they can be used interchangeably.
ROOT Dependency
Since ROOT version 6.12, a new type of locking was introduced to replace the old mechanism used by the TRef
during copying and destruction. However, these changes affect the framework multithreading approach when using more than 8 threads. In such cases, the performance degrades as threads are all locked when any TRef
is being copied.
The issue was previously reported and some improvements were made to reduce the degradation from this problem, however, the problem still exists on the latest version of ROOT.
A new report is submitted to the ROOT forum against ROOT latest version.
Testing
A new test case test_performance/test_03_multithreading.conf
was added to gauge the framework's multithreading capabilities.
My test plan for this feature includes:
- Running all the CI tests
- Running all the CI tests and forcing MT ON
- Running all benchmarks with and without MT with different number of workers
- Comparing TextWriter module output between different configuration
- Comparing all buffered modules output when MT is enabled, to ensure the same output file
- Running the example configuration with Valgrind
Benchmarks
The benchmarks compare the current multithreading approach against the new multithreading feature in the framework. The new set of benchmarks that mimics a typical Allpix setup can be seen here and the results can be accessed on CERN AFS: /afs/cern.ch/user/m/momali/GSoC19/old_benchmarks
or seen here.
The benchmarks were executed on CERN LXPLUS using a 40 cores machine.
Performance Analysis
To gauge the framework new performance, the following analysis with Intel's vTune profiler was carried out using a 12 detectors setup running on 8 cores:
-
CPU Time: We can notice there are some gaps of inactivity due to the synchronization overhead of a writer module.
-
CPU Wait Time: Notice the overhead for the writer
BufferedModule
. The worst-case happens when a thread is writing to the output file and flushes the buffered events on the same thread. Then other threads try to write too but they are blocked since they can't operate at the same time on the shared buffer - to store events while the buffer is being read.
Future Work
To further improve the framework design and performance some additional work can be noted as follows:
- Reschedule buffered events in the
BufferedModule
: Currently, theBufferedModule
saves out of order events and execute them later using the same thread. It would be better to re-submit such buffered events to theThreadPool
. - Don't fall back to single-threaded execution: When one or more modules don't declare that they are multithreaded ready, the framework will fall back to execute on a single thread. However, other modules can still be executed in parallel as in the current multithreading approach. Also, this can be tricky as in the case of
VisualizationGeant4
which can't execute in parallel and enforces other GEANT4 modules also not to execute in parallel.