Optimise Monitoring (!1483) · Merge requests · LHCb / Allen

This MR change the way we do Monitoring at many levels:

Use a single algorithm instance shared between all streams. Turns out very little change was necessary, and it avoids duplications of accumulators, as well as getting closer to Gaudi.
Do inter-stream aggregation on GPU, through atomic operations. It reduces the amount of data that needs to be copied back to the host, and the computation that has to be done on CPU to merge accumulators.
Accumulators are kept on device for multiple sequence run, and periodically reset once every second, using double-buffering to not stall streams. All accumulators can be copied and reset in a single memcpy/memset, reducing pressure on the work queues.

Gaudi-like user interface for histograms and counters. Reduce boilerplate
Device interface to histograms and counters. Avoid user mistakes and hide device specific optimisations.
Opportunistic warp aggregation. To reduce the number of atomics in global memory
LogHistograms support (with gaudi/Gaudi!1564 (merged)). Correctly display variable bin-size histograms in ROOT. Use a mapping function instead of a support array + binary search.
Runs most of the monitoring also in STANDALONE mode.

Counters that were incremented multiple time (all threads adding same value to same counter)
Unsafe (non-atomic) incrementation of counters on GPU
Copy-paste typos in histograms range check
GPU Counters not printed at the end of application

Soft dependency on gaudi/Gaudi!1564 (merged) (can be merged independently, but the MR in Gaudi is needed for LogHistograms to appear properly)

Optimise Monitoring