Optimise Monitoring
This MR change the way we do Monitoring at many levels:
Framework changes:
- Use a single algorithm instance shared between all streams. Turns out very little change was necessary, and it avoids duplications of accumulators, as well as getting closer to Gaudi.
- Do inter-stream aggregation on GPU, through atomic operations. It reduces the amount of data that needs to be copied back to the host, and the computation that has to be done on CPU to merge accumulators.
- Accumulators are kept on device for multiple sequence run, and periodically reset once every second, using double-buffering to not stall streams. All accumulators can be copied and reset in a single memcpy/memset, reducing pressure on the work queues.
Features:
- Gaudi-like user interface for histograms and counters. Reduce boilerplate
- Device interface to histograms and counters. Avoid user mistakes and hide device specific optimisations.
- Opportunistic warp aggregation. To reduce the number of atomics in global memory
- LogHistograms support (with gaudi/Gaudi!1564 (merged)). Correctly display variable bin-size histograms in ROOT. Use a mapping function instead of a support array + binary search.
- Runs most of the monitoring also in STANDALONE mode.
Bug Fixes:
- Counters that were incremented multiple time (all threads adding same value to same counter)
- Unsafe (non-atomic) incrementation of counters on GPU
- Copy-paste typos in histograms range check
- GPU Counters not printed at the end of application
FYI @sponce @gligorov @cagapopo @rmatev @raaij @dovombru @kaaricha
Closes #424 (closed)
Soft dependency on gaudi/Gaudi!1564 (merged) (can be merged independently, but the MR in Gaudi is needed for LogHistograms to appear properly)
Edited by Arthur Marius Hennequin