Mohamed Moanis Ali requested to merge momali/allpix-squared:custom-geant4 into multithreading Aug 01, 2019

This merge request concludes my Google Summer of Code 2019 project: Implement Event-based Seeding and Multi-Threading

This work is based on the previous work in GSOC18 and completes the effort needed to allow Allpix-Squared to run simulation events in parallel.

~~This is a WIP until:~~

~~Remove unnecessary locking for Reader/Writer modules~~
~~Complete testing and benchmarking~~
~~Complete the documentation and user manual~~

~~Please note that the LXPLUS CI build is disabled for now since it uses an outdated version of Geant4 dependencies and it needs to be updated.~~

All benchmarks, the project proposal, and all additional notes and artifacts created during the project can be found in this external repository. While all slides and presentation for the weekly meetings are present in this Google drive folder. The weekly project timeline can also be seen here

Event-based Multithreading

The new multithreading approach executes the event loop in parallel instead of sequential execution. The level of parallelism has changed to be on the event level instead of the module level which allows for better performance and better scaling on multicore machines.

A detailed account of the major changes to different modules is described below along with the testing plan, benchmarks to compare the current and the new multithreading approaches, performance analysis with vTune profiler and future work

Event Class

For this purpose, the lightweight class Event is defined to hold all data for an event. The ModuleManager runs the event loop using the ThreadPool to execute independent events in parallel. For each event, an Event object is constructed with a unique number that identifies that specific event.

To ensure simulation reproducibility, each object holds a reference to a random number generator (RNG) to be used exclusively by the modules executing this event by calling getRandomNumber() or getRandomEngine() methods. The RNG is seeded with a unique seed generated specifically for each event by the core RNG owned by the ModuleManager. The seed is drawn from the core RNG and saved with the event object before submitting it to parallel execution and therefore guarantees the exact seed distribution regardless of the number of threads and the event execution order.

To optimize the memory footprint of the Event class, the storage of the RNG object is static thread local. This way ensures that the framework maintains the minimum number of heavy RNG objects equal to the number of threads used that is needed to execute tasks independently.

Module thread safety

In this new approach, all modules are expected to run multiple events at the same time. Therefore, the module's run method should be re-entrant and the module author is expected to synchronize access to any shared member variables.

By default, the framework assumes the module doesn't support parallelization to maintain backward compatibility. Similarly, the module's author must specify in the module's constructor that the module is ready for running events in parallel by calling enable_parallelization(). This is very important to happen in the constructor so that the framework can decide if the current configuration, all the modules, can be run in parallel or not!

If for example, some modules in a given configuration don't support multithreading, the framework will fall back to single thread execution and informs all modules about that decision. As such, the module should use canParallelize() method for multithreading specific logic.

Writer Modules

Since modules that write to an output file would be interested in a way to write the events in the ascending order of the event number, the BufferedModule is provided to abstract a buffering mechanism which buffers out of order events and executes them later. Such modules should inherit from the BufferedModule and implement the same interface for Module and they should expect that each time the run method is called, the events will be in the correct order and can be written to the file without further sorting or storing.

Such modules are restricted not to dispatch messages using the messenger since other modules may have already finished processing the current event.

Messenger Class

The messenger class was split into two classes with different responsibilities. To maintain backward compatibility, the Messenger class is kept as a single instance object that is passed to modules and used to bind message listeners. However, modules now are expected to fetch the messages by themselves contrary to the previous mechanism of specifying a member variable that is updated when a message is dispatched. The MessageNotFoundException will be thrown in case a module tries to fetch messages that were not sent.

The public API for the Messenger has changed and now requires the Event object as a parameter for each of its calls except when registering a message listener. The reason for that comes from the second class LocalMessenger which abstracts all logic and the actual storage for messages. Each event has an instance of this new class that maintains modules communication withing that event.

The Messenger only keeps the information about message subscription which is shared across all the local messengers. Additionally, it now supports subscribing to multiple message types and offers the required mechanisms to check if all required messages for a given module were dispatched.

All calls to Messenger::dispatchMessage and Messenger::fetchMessage are handled internally by the LocalMessenger responsible for the given Event object and thus the Messenger transparently works as a single point of communication between modules while it keeps event-specific data stored separately.

ThreadPool

The ThreadPool now will transparently execute the given tasks on the caller thread if it has no workers. This is the case when running Allpix-Squared without the multithreading option. The ModuleManager will construct the pool with 0 threads when multithreading is not requested. This change is needed to accommodate for Geant4 requirements as discussed in the Geant4 section.

Furthermore, the thread pool is initialized with a fixed queue size. Once the queue is full, calls to submit a new task will be blocked. This is needed to submit the events in patches to avoid using more memory than is needed.

Geant4

The main focus of the GSoC19 project was to fix the problem of using Geant4 in parallel. More specifically, the provided run managers G4RunManager and G4MTRunManager do not fit with the framework's use case because the first can not be called from multiple threads and the second provides internal parallelism by creating its own threads to execute the call to BeamOn. However, Allpix-Squared requires that all modules should be called by its threads without creating further threads and it is supposed not to have knowledge about Geant4 specific details since in theory a user can configure Allpix-Squared without Geant4 at all.

To overcome these issues, 2 customized run managers for Geant4 library were developed:

RunManager to replace G4RunManager
MTRunManager to replace G4MTRunManager

The new run manager MTRunManager provides the new APIs as follows:

void InitializeForThread(): Must be called by the Allpix-Squared thread that is going to use the manager later on
void Run(event_number, number_of_particles): Multithreaded equivalent of BeamOn, the event number is used to choose unique seeds to initialize the RNG specific for this thread. This must not be called by the main thread that initialized the manager!
void TerminateForThread(): To be called by each thread to cleanup thread-specific resources

Under the hood, MTRunManager (master) creates thread local WorkerRunManagers (workers). These managers share the world geometry but each has its own user hooks and actions. To ensure that simulations can be reproduced, the master manager maintains a random number generator (RNG) that is used to generate a list of random numbers each is associated with a specific event number and is used to initialize the worker's RNG when it is invoked to run that specific event.

While the new MTRunManager make it possible to call the DepositionGeant4Module in parallel and remove the specific checks that were adapted earlier to run Geant4 modules, it was not sufficient to make the VisualizationGeant4Module work since it call's BeamOn behind the scenes in its finalize() method on the main thread which is not what MTRunManager was customized for. To fix this issue, another custom manager was developed RunManager which is nothing but minimal customization over G4RunManager to use the same RNG initialization technique as adapted by the MTRunManager to ensure that they can be used interchangeably.

ROOT Dependency

Since ROOT version 6.12, a new type of locking was introduced to replace the old mechanism used by the TRef during copying and destruction. However, these changes affect the framework multithreading approach when using more than 8 threads. In such cases, the performance degrades as threads are all locked when any TRef is being copied.

The issue was previously reported and some improvements were made to reduce the degradation from this problem, however, the problem still exists on the latest version of ROOT.

A new report is submitted to the ROOT forum against ROOT latest version.

Testing

A new test case test_performance/test_03_multithreading.conf was added to gauge the framework's multithreading capabilities.

My test plan for this feature includes:

Running all the CI tests
Running all the CI tests and forcing MT ON
Running all benchmarks with and without MT with different number of workers
Comparing TextWriter module output between different configuration
Comparing all buffered modules output when MT is enabled, to ensure the same output file
Running the example configuration with Valgrind

Benchmarks

The benchmarks compare the current multithreading approach against the new multithreading feature in the framework. The new set of benchmarks that mimics a typical Allpix setup can be seen here and the results can be accessed on CERN AFS: /afs/cern.ch/user/m/momali/GSoC19/old_benchmarks or seen here.

The benchmarks were executed on CERN LXPLUS using a 40 cores machine.

Performance Analysis

To gauge the framework new performance, the following analysis with Intel's vTune profiler was carried out using a 12 detectors setup running on 8 cores:

CPU utilization:
CPU Time: We can notice there are some gaps of inactivity due to the synchronization overhead of a writer module.
CPU Time by Module:
CPU Wait Time: Notice the overhead for the writer BufferedModule. The worst-case happens when a thread is writing to the output file and flushes the buffered events on the same thread. Then other threads try to write too but they are blocked since they can't operate at the same time on the shared buffer - to store events while the buffer is being read.

Future Work

To further improve the framework design and performance some additional work can be noted as follows:

Reschedule buffered events in the BufferedModule: Currently, the BufferedModule saves out of order events and execute them later using the same thread. It would be better to re-submit such buffered events to the ThreadPool.
Don't fall back to single-threaded execution: When one or more modules don't declare that they are multithreaded ready, the framework will fall back to execute on a single thread. However, other modules can still be executed in parallel as in the current multithreading approach. Also, this can be tricky as in the case of VisualizationGeant4 which can't execute in parallel and enforces other GEANT4 modules also not to execute in parallel.

Edited Aug 20, 2019 by Mohamed Moanis Ali

GSoC19 Executing Events in Parallel