Ideas towards asynchronous IO (was ACTSFW-75)

Original author Hadrien Benjamin Grasland @hgraslan.

This is the kind of topic that is brought up from time to time, the latest iteration being the comments of ACTSFW-61. So even if it's not a high-priority feature, I think it came to deserve its own JIRA ticket.

Currently, we run blocking IO operations directly on event processing threads. This means that any IO pause propagates into the event loop and slows everything down, painting our multi-threaded scalability into a grimmer light than it actually is.

This is made worse by the fact that modern IO systems try to amortize access latencies and optimize data compression by submitting big batches of IO work at once. So most of the time, submitting an IO request does nothing, and from time to time, it takes a long while. This leads to full CPU stalls (all event loop threads blocked) and confusing behaviour (scalability issues that only appear for sufficiently large numbers of events).

Disabling IO in performance tests is cumbersome, and not always a sensible option. For readout, it forces us to randomly generate our input data, which may have intrinsic CPU overhead and give us less realistic input. For both writes and reads, it makes us ignore persistence overhead, and thus potentially focus on the wrong thing (there is little point in optimizing the computations of an IO-bound program).

A better option would be to decouple IO work from the event loop so that temporary IO pauses do not usually result in computations being slowed down. This is where asynchronous IO comes in.

At the heart of avoiding IO pauses lies the concept of buffering. Input code should try to get a bit ahead of what is requested by the event loop so that whenever it blocks and stops producing data, the event loop still has "old" data to crunch. And conversely, output code should be able to accumulate a couple of event loop requests while a write is occuring instead of blocking everything.

A minimal implementation would be to have each IO entity (Reader or Writer) live in its own thread and only interact with the event loop via a bounded queue. Readers fetch data from IO resources "by default" and put it into their associated event loop input queue until it's full, while on its side the event loop pushes data into its output queues (which feed the Writers) as soon as possible, and as long as these aren't full.

Making the queues bounded provides rate limiting, which ensures that RAM usage won't explode if one component is way faster than another.

Questions that must still be answered in order to implement this concept:

Data storage abstraction and synchronization. So far, we got away with a single unified thread-private whiteboard. If we get dedicated IO threads, we need to either synchronize whiteboard accesses between Readers or add new storage abstractions such as reader-private queues (which may be hidden by the whiteboard abstraction).
Data lifetime management. Now, the event store whiteboard (or an extended form thereof) must conceptually be created before the associated "computational" event loop iteration begins (so that readers have a place to prefetch data into), and continue to live after the end of the event loop iteration (so that the writers can access the output).
Error handling. When something goes wrong in a Reader or Writer, the event loop must have a way to know about it so that it aborts reasonably promptly, and doesn't sleep forever on an empty input queue or on a full output queue.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information