Measure the performance of calibration scans

A calibration scan consists of 3 parts:

configure the FEs
readout the calibration data (i.e. produce the FrontEndData)
plot and analyze it

Ideally:

the readout part of the calibration (the HWController push to DataProcessor, which pushes the FrontEndData to the calibration analysis) runs as fast as the triggering (e.g. 500 triggers / 10kHz = 0.05s).
And the triggering is done at the HW limit frequency for the full occupancy data packets.

Then, we can push the readout to its HW limit, i.e. the HW limit of the triggering. And in the overall calibration, the limit will be the analysis part.

Factors:

Felixcore or felix-star & rdma? (matters only if the network is indeed a bottleneck now)
Number of triggers per iteration (should just scale, we do not send too many triggers for the calibrations)
Trigger frequency (what is the HW limit for full occupancy packets? Do we need FW triggers to reach it?)
Run from 1 YARR & many – if the network & HWhandler make a bottleneck (and to speed up the plotting)
Run with and without the analysis, only saving the data to the disk
Also, try to save the output data to a memory-mounted disk.
Then, at some point, we could also try the hit counters.

So, we need to start from our standard setup: Felixcore, 500 SW triggers at 10kHz, no memory-mounted disk, with all of the analysis and try to run from 1 YARRs. See the profile, identify the current bottleneck. Then, according to what’s needed, try multiple YARRs, try memory-mounted disk, etc. If everything is perfect, we will just increase the trigger frequency.

Together with the profile, we need to measure these times:

The time to configure – Yarr already measures that, right?
HWController’s handler (push) – this one must be within the triggering limit, otherwise it is the bottleneck
StdDataLoop iteration & time between iterations
- within it: DataProcessors parsing (pop)
Also: analysis time & time to save to the disk

And more metrics:

cache hits
CPU occupancy?
memory occupancy?

Commands to use

`perf` flamegraph profile

sudo perf record -F 99 -g -- <scanConsole command>
# or with /bin/time
sudo perf record -F 99 -g -- /bin/time <scanConsole command>

# it may work without sudo!
# I am not sure if it will save all stack frames then (the ones from inside linux too?)
# produces a perf.data file
# -F sets the frequence of 99 samples per second -- increase if more statistics is needed

# the flamegraph from the perf.data file:
sudo perf script | stackcollapse-perf.pl > out.perf-folded
cat out.perf-folded | flamegraph.pl > perf-kernel.svg

It needs stackcollapse-perf.pl and flamegraph.pl Perl scripts.

Cache hits, CPU, memory

The cache hits, CPU, etc cannot be obtained from perf simultaneously with the record of the call stack profile. So, they will have to be run separately:

# CPU and cache hits:
perf stat <scanConsole>

# memory usage:
# TODO

Additional time counters inside YARR

For HWController, NetioHandler passes a lambda as the handler. Its scope will not allow for a time counter. Make a dedicated NetioHandler method for the handler?

In FelixRxCore::on_data, it's straightforward.

In StdDataLoop, we need the times exec2 - exec1 and exec1 - exec2 for the iteration time and between-iterations:

// src/libYarr/include/StdDataLoop.h

+#include <chrono>
+using Clock = std::chrono::steady_clock;

 class StdDataLoop: public LoopActionBase, public StdDataAction {
...
+
+        // additional timings for calibrations performance
+        std::chrono::time_point<Clock> exec1_time;
+        std::chrono::time_point<Clock> exec2_time;
+        std::chrono::microseconds time_of_iteration(0); // initialize with 0
+        std::chrono::microseconds time_between_iterations(0);
+        bool started_iterations = false;
 };

// src/libYarr/StdDataLoop.cpp

+StdDataLoop::~StdDataLoop() {
+    SPDLOG_LOGGER_INFO(sdllog, "Time of      iterations {} [us]", time_of_iteration.count());
+    SPDLOG_LOGGER_INFO(sdllog, "Time between iterations {} [us]", time_between_iterations.count());
+}

 void StdDataLoop::execPart1() {
+    exec1_time = Clock::now();
+    if (started_iterations) {
+        time_between_iterations +=
+            std::chrono::duration_cast<std::chrono::microseconds>(exec1_time - exec2_time);
+    }
+    else started_iterations = true;
+
...
}

 void StdDataLoop::execPart2() {
...
+
+    exec2_time = Clock::now();
+    time_of_iteration +=
+            std::chrono::duration_cast<std::chrono::microseconds>(exec2_time - exec1_time);
 }

And in the processing: add up the time inside the while loop of StarDataProcessor::process_core.

Edited Sep 19, 2023 by Alex Toldaiev

Measure the performance of calibration scans

Commands to use

perf flamegraph profile

Cache hits, CPU, memory

Additional time counters inside YARR

`perf` flamegraph profile