Measure the performance of calibration scans
A calibration scan consists of 3 parts:
- configure the FEs
- readout the calibration data (i.e. produce the
FrontEndData
) - plot and analyze it
Ideally:
- the readout part of the calibration (the
HWController
push toDataProcessor
, which pushes theFrontEndData
to the calibration analysis) runs as fast as the triggering (e.g. 500 triggers / 10kHz = 0.05s). - And the triggering is done at the HW limit frequency for the full occupancy data packets.
Then, we can push the readout to its HW limit, i.e. the HW limit of the triggering. And in the overall calibration, the limit will be the analysis part.
Factors:
- Felixcore or felix-star & rdma? (matters only if the network is indeed a bottleneck now)
- Number of triggers per iteration (should just scale, we do not send too many triggers for the calibrations)
- Trigger frequency (what is the HW limit for full occupancy packets? Do we need FW triggers to reach it?)
- Run from 1 YARR & many – if the network & HWhandler make a bottleneck (and to speed up the plotting)
- Run with and without the analysis, only saving the data to the disk
- Also, try to save the output data to a memory-mounted disk.
- Then, at some point, we could also try the hit counters.
So, we need to start from our standard setup: Felixcore, 500 SW triggers at 10kHz, no memory-mounted disk, with all of the analysis and try to run from 1 YARRs. See the profile, identify the current bottleneck. Then, according to what’s needed, try multiple YARRs, try memory-mounted disk, etc. If everything is perfect, we will just increase the trigger frequency.
Together with the profile, we need to measure these times:
- The time to configure – Yarr already measures that, right?
- HWController’s handler (push) – this one must be within the triggering limit, otherwise it is the bottleneck
- StdDataLoop iteration & time between iterations
- within it: DataProcessors parsing (pop)
- Also: analysis time & time to save to the disk
And more metrics:
- cache hits
- CPU occupancy?
- memory occupancy?
Commands to use
perf
flamegraph profile
sudo perf record -F 99 -g -- <scanConsole command>
# or with /bin/time
sudo perf record -F 99 -g -- /bin/time <scanConsole command>
# it may work without sudo!
# I am not sure if it will save all stack frames then (the ones from inside linux too?)
# produces a perf.data file
# -F sets the frequence of 99 samples per second -- increase if more statistics is needed
# the flamegraph from the perf.data file:
sudo perf script | stackcollapse-perf.pl > out.perf-folded
cat out.perf-folded | flamegraph.pl > perf-kernel.svg
It needs stackcollapse-perf.pl
and flamegraph.pl
Perl scripts.
Cache hits, CPU, memory
The cache hits, CPU, etc cannot be obtained from perf simultaneously with the record
of the call stack profile. So, they will have to be run separately:
# CPU and cache hits:
perf stat <scanConsole>
# memory usage:
# TODO
Additional time counters inside YARR
For HWController, NetioHandler
passes a lambda as the handler. Its scope will not allow for a time counter. Make a dedicated NetioHandler
method for the handler?
In FelixRxCore::on_data
, it's straightforward.
In StdDataLoop
, we need the times exec2 - exec1
and exec1 - exec2
for the iteration time and between-iterations:
// src/libYarr/include/StdDataLoop.h
+#include <chrono>
+using Clock = std::chrono::steady_clock;
class StdDataLoop: public LoopActionBase, public StdDataAction {
...
+
+ // additional timings for calibrations performance
+ std::chrono::time_point<Clock> exec1_time;
+ std::chrono::time_point<Clock> exec2_time;
+ std::chrono::microseconds time_of_iteration(0); // initialize with 0
+ std::chrono::microseconds time_between_iterations(0);
+ bool started_iterations = false;
};
// src/libYarr/StdDataLoop.cpp
+StdDataLoop::~StdDataLoop() {
+ SPDLOG_LOGGER_INFO(sdllog, "Time of iterations {} [us]", time_of_iteration.count());
+ SPDLOG_LOGGER_INFO(sdllog, "Time between iterations {} [us]", time_between_iterations.count());
+}
void StdDataLoop::execPart1() {
+ exec1_time = Clock::now();
+ if (started_iterations) {
+ time_between_iterations +=
+ std::chrono::duration_cast<std::chrono::microseconds>(exec1_time - exec2_time);
+ }
+ else started_iterations = true;
+
...
}
void StdDataLoop::execPart2() {
...
+
+ exec2_time = Clock::now();
+ time_of_iteration +=
+ std::chrono::duration_cast<std::chrono::microseconds>(exec2_time - exec1_time);
}
And in the processing: add up the time inside the while
loop of StarDataProcessor::process_core
.