Skip to content

Draft: Implement asynchronous I/O in HltEventLoopMgr

Rafal Bielski requested to merge rbielski/athena:hlt-evloop-async-io into master

Change the implementation of the main event loop in HLT from synchronous jumping between input handling and output handling on one thread to independent asynchronous I/O threads.

The threading / synchronisation boilerplate code is put into a separate header file and is now shared by three threads in the HltEventLoopMgr: the timeout monitoring thread (almost no change to its implementation) and the new input and output handling threads. The main thread of the application only starts the three loop threads and sleeps until the end of the event loop.

Like before this MR, the input and output parts of the HltEventLoopMgr (previously on the main thread, now in separate threads) still don't do much work. They only define, prepare and schedule a TBB Task which is then executed by TBB worker threads. This keeps the CPU cost of the HltEventLoopMgr threads at near zero, and all the work is done by the pool of TBB worker threads, which is more natural and predictable in performance profiling. The only functional change in this MR is that the TBB tasks for I/O handling are now launched asynchronously and don't block each other. Previously, we could not fill a free Gaudi Scheduler slot while waiting for a finished event, and similarly could not pop a finished event from the Gaudi Scheduler while waiting for a new event from input source.

The new implementation required a somewhat significant change in error handling. Since the input and output handling are now asynchronous, the concept of "draining all Scheduler slots" mid-run and then continuing to fill them again isn't so trivially possible. Implementing such behaviour would require the output thread to "pause" the input thread and then after clearing all slots, "resume" it. With the experience of 2022 P1 operations I can say the drain-all-slots procedure wasn't really needed in the first place. If things go wrong on the framework side, it is fine to exit the event loop with a failure and restart the entire process from scratch. This will now happen in a wider range of errors (most of which have never been encountered in practice) which previously attempted the "drain all slots and continue" approach.

Jira: ATR-26285

FYI @fwinkl, @wiedenma, @mark

Simplified call diagrams (skipping the start/end of loop conditions):
before this MR:

after this MR:

Edited by Rafal Bielski

Merge request reports