Simon Spannagel requested to merge mt-overhaul into master Aug 19, 2018

(This is a copy of the original merge request filed against our mirrored repository on GitHub. The original merge request can be found here and has been prepared by Viktor Sonesten, our GSoC 2018 student. The description below is taken from his pull request.)

Overhaul multi-threading approach: execute events in parallel

This pull request constitutes my Google Summer of Code 2018 project: Improve multi-threading support for CERN’s Allpix-Squared project using dependency graphs.

This branch (mt-overhaul) overhauls the multi-threading approach of the framework to simulate events in parallel instead of parallelizing the execution of modules that are explicitly thread-safe. A detailed account of the changes to the multi-threading approach is available below. Benchmarks between the old and new approach, along with another branch of interest, are also available below.

Configurations used for benchmarking can be found here.

Contrary to the title of the project, this new approach does not utilize dependency graphs; building one with a simulation's events would take up too much memory (e.g. a simulation with million events) and would also clash with the general design of the framework. However, this project does improve the multi-threading support for Allpix² in regards to increased performance and improved resource utilization.

The GitLab CI tests passes successfully for this branch and are available here.

Multi-threading is still considered an experimental feature in Allpix². Until all file writer modules have been confirmed to be working correctly, and until the performance of Allpix² properly scales when using more than eight workers, it will remain experimental.

When the above has been confirmed, it is likely that this branch is merged into the framework for a new major release, Allpix-Squared v2.0.

The multi-threading performance for Allpix² has via this project been almost completely optimized for worker/core counts up to 8. Or in the case of removing the ROOT inheritance from the base Allpix² object, for all worker/core counts up to 32. The main target of Allpix² are systems with four CPU cores. Removal of ROOT inheritence is done with this commit.

Multi-threading approach

Allpix² now supports experimental multi-threading by running events in parallel. For the number of events configured, an Event instance is constructed and submitted to the thread pool. As soon as an event is pushed to the thread pool a free worker will pick it up and execute it; when done it will pick up the next event. When all events have been executed Allpix² begins terminating.

Module thread safety

An event is constructed with a shared list of the loaded modules. Inside an event, and thus for each worker, these modules are run sequentially, modifying the Event instance given to them. In the context of an event, the only module function used is the Module::run(Event*)-method. To enable thread safety, this function has been made constant; the run()-method may not access or modify member fields of the module unless the module author has explicitly allowed this. If a module's run()-function does access a member field (which is then accessed by multiple workers) it must be appropriately protected with a mutex and should for performance reasons only be used to record module statistics.

Readers and writers

Modules that read from and write to files are not subject to the rule presented in the previous section. Between two simulations using the same configuration, if the simulated events are to be written to file, the two simulations must produce the same output file. To achieve this, a module author will appropriately mark their module as a reader or writer via inheritance. During run-time, Allpix² checks whether the module is a reader, writer or neither; if it does access files, it will be run in order of increasing event number. E.g. a reader/writer of event 4 will be run before the same module(s) of event 5. If it isn't a specific event's turn to run this module, the thread will block and wait for its turn. While blocking, the thread does no work. A reader and a writer may run in parallel, because they do not access the same files.

Seed distribution

To enable two matching simulations to produce the same output, the seed distribution must also match. Each Event instance now owns a random engine that can be accessed and modified in whatever manner a running module wishes. For a stable seed distribution, Allpix² holds a top-level random engine that seeds every event's engine in a sequential manner in order of increasing event number. This ensures that two matching simulations produce the same output, even when the number of workers do not match.

Messages

Modules communicate with each other via messages. Considering that events are independent of each other, messages from a module to another must be local to the event. Subscription to specific messages are done in the modules' constructors and is thus global, but this read-only delegate information is now shared with all event-local messengers. These are then responsible for dispatching and fetching messages in the event. Upon the termination of an event, all messages local to that event are deleted.

Geant4 and ROOT

Unfortunately, because of the dependencies on Geant4 and ROOT and their incorporation into event execution, Allpix² cannot simply spawn n workers and expect a new run-time of one nth of the original single-threaded run-time.

Geant4 is responsible for building the simulation geometry and depositing charges in the detectors. When working in a multi-threaded context, Geant4 expects to handle the parallel work internally; this clashes with the design of Allpix². Because of this, all modules that runs Geant4 code may not run on spawned threads, lest the program will segfault. In Allpix², all instanced Events are thus initialized by the master thread by running these modules before submission to the thread pool. To prevent workers from being idle, a buffer of initialized events is maintained by the module manager.

ROOT, specifically the TObject and TRef classes, are used by all objects in Allpix² to relate to other objects. These classes, because of legacy reasons, do not scale well to a high number of running threads. For this reason, it is not currently recommended to use more than 8 workers during simulation as this will greatly impact performance negatively.

Removing the inheritance on ROOT for the Allpix² base object greatly increases performance, as can be seen in the benchmarks below.

Benchmarks

Below are benchmarks between the old multi-threading approach, the new multi-threading approach, the new approach without ROOT inheritance ("no root"), and the optimal run-time for an increasing number of workers/cores. The optimal plots' first point are the same as the old approach's; subsequent points are calculated by dividing the single-threaded run-time with the current number of workers/cores.

The server on which these branches where benchmarked on had 48 cores at its disposal, hence a maximum of 47 workers — the remaining core is used by the master thread.

The reason for the greater deviation of the "no root" plot from the optimal plot past 32 workers in the benchmarks above is presently unknown.

Future improvements

Investigate if ROOT penalty can be minimized

A thread was created on the ROOT forums regarding the scaling issues we had with ROOT. After a minimal example was created, mimmicking the event construction and execution in Allpix² (while not doing any real work), a performance patch was written and merged into the master branch. Unfortunately, running some benchmarks with this patch eventually crashes Allpix² via segmentation fault. Finding the cause and amending the minimal example to reproduce it should allow us to properly benchmark with a stable patch.

Remove ROOT inheritance

As can be seen in the benchmarks above, removing the inheritance on ROOT for the base Allpix² object allows us to reach an almost optimal run-time performance; effectively future proofing the framework for systems with a high number of CPU cores. The final TObjects can then be constructed just before they are written to file.

Investigate and fix buffer implementation

The module manager maintains a buffer of initialized events (where Geant4 modules have been run on the main thread) which are submitted to the thread pool in order as to not let any workers be idle. This is done by checking the size of the thread pool's internal queue of events every 50ms. To maintain acceptable memory usage throughout the execution of a simulation, the buffer should never exceed number of workers x 4 events. Asserting this predicate after each submission to the thread pool however leads to crashes. A stricter check allows us to satisfy this predicate, but at the cost of having some idle threads. What's odd is that increasing the buffer size does not fix the issue.

Resolving this issue should allow the framework to have much better memory usage throughout any simulation, especially those where millions of events are simulated.

Make Geant4 modules able to run on spawned threads

A thread was created on the Geant4 mailing list regarding our problems of running Geant4 code on spawned threads. According to replies, it may be possible to make Geant4 work as we want it to. If Geant4 modules can then run on spawned threads, it would allow us to get rid of the buffer the module manager maintains, improving the framework's memory usage and containing the full run of an event in its run()-method. It would also allow us to get rid of the master thread that currently only initializes events.

Buffer file writes

At present, when an event reaches the execution of a writer module, its execution will block until all previous events have been written to file. This leaves threads idle if previous events happen to take longer to simulate. If these writes are instead buffered (by buffering the execution of the writer modules for that specific event) threads should never end up idle, and should increase framework performance somewhat. This is at the cost of keeping more events in memory. A later event would then check if any events have been buffered, and if that's the case, write any previous events before writing self to file, removing them from the buffer.

Edited Aug 19, 2018 by Simon Spannagel

Admin message

WIP: GSoC 2018: Overhaul multi-threading approach: execute events in parallel