YARR issueshttps://gitlab.cern.ch/YARR/YARR/-/issues2024-03-29T01:09:30+01:00https://gitlab.cern.ch/YARR/YARR/-/issues/244Fix buffering of command packets via netio-watermark option instead of aggreg...2024-03-29T01:09:30+01:00Angira RastogiFix buffering of command packets via netio-watermark option instead of aggregation from YARRNeeds to be understood...Needs to be understood...Angira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/236Make the sleep time in isCmdEmpty() of FelixTxCore configurable2024-03-29T01:07:30+01:00Angira RastogiMake the sleep time in isCmdEmpty() of FelixTxCore configurableFor FELIX test setups which do not have other hidden latencies in the DAQ chain, should we make the `sleep time` in isCmdEmpty() [here](https://gitlab.cern.ch/YARR/YARR/-/blob/devel/src/libFelixClient/FelixTxCore.cpp#L186) a configurable...For FELIX test setups which do not have other hidden latencies in the DAQ chain, should we make the `sleep time` in isCmdEmpty() [here](https://gitlab.cern.ch/YARR/YARR/-/blob/devel/src/libFelixClient/FelixTxCore.cpp#L186) a configurable parameter from the controller config? Analogous to the flushWaitTime parameter [here](https://gitlab.cern.ch/YARR/YARR/-/blob/devel/configs/controller/felix_client_pixels.json#L29) for FelixRxCore.
This will enable faster configuration for multi-chip or multi-module tests, until we can get rid of this manual wait time while flushing the buffers completely. This can be possible once the flush method of send_data() in felix-star becomes completely blocking with a user callback to check on the status.Angira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/243Blocking flush method with a user callback to remove ad hoc sleep time in isC...2024-03-29T01:05:47+01:00Angira RastogiBlocking flush method with a user callback to remove ad hoc sleep time in isCmdEmpty() of FelixTxCoreNeed to be followed up with TDAQ...Need to be followed up with TDAQ...Angira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/242Install Mellanox network card on the host PC and test ethernet interface for ...2024-03-29T01:02:06+01:00Angira RastogiInstall Mellanox network card on the host PC and test ethernet interface for YARR scansAim to test running two FELIX cards installed on the same host PC where one of them runs with the ethernet interface while the other one uses Mellanox card for RDMA.Aim to test running two FELIX cards installed on the same host PC where one of them runs with the ethernet interface while the other one uses Mellanox card for RDMA.Angira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/241Run YARR calibration scans with felix-star on ALMA9 machine2024-03-29T00:58:22+01:00Angira RastogiRun YARR calibration scans with felix-star on ALMA9 machineTo test:
1) Running felix-star server on centos7 (SW: 5.0.2, driver: 4.15, FW: current release) + YARR scans on ALMA9 => via different host PCs
2) Running felix-star server on ALMA9 (SW: 5.0.3, driver: 4.15, FW: current release) + YARR ...To test:
1) Running felix-star server on centos7 (SW: 5.0.2, driver: 4.15, FW: current release) + YARR scans on ALMA9 => via different host PCs
2) Running felix-star server on ALMA9 (SW: 5.0.3, driver: 4.15, FW: current release) + YARR scans on ALMA9 => on the same PCAngira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/240FW-trigger generation doesn't work from any downlinks except 02024-03-29T00:51:45+01:00Angira RastogiFW-trigger generation doesn't work from any downlinks except 0See - [FLXUSERS-682](https://its.cern.ch/jira/browse/FLXUSERS-682)See - [FLXUSERS-682](https://its.cern.ch/jira/browse/FLXUSERS-682)Angira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/239Figure out the lane mapping for down elinks from different DPs for zaza board...2024-03-29T00:47:39+01:00Angira RastogiFigure out the lane mapping for down elinks from different DPs for zaza board v0 when running with felix-starNot able to configure the FE chip while running YARR calibration scans from display ports other than 2 (downlink=0) and 3 (downlink=2).
Lane mapping provided from Zaza seems inconsistent with enabled links for felix-toflx process.Not able to configure the FE chip while running YARR calibration scans from display ports other than 2 (downlink=0) and 3 (downlink=2).
Lane mapping provided from Zaza seems inconsistent with enabled links for felix-toflx process.Angira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/238More flexible trigger pattern generator2024-03-28T11:17:29+01:00Bruce Joseph GallopMore flexible trigger pattern generatorNB I think this is a FrontEnd specific thing, but there might be some restricted list of common things that can be done.
This does also imply that the current 32 words in TxCore are not sufficient.
The most common patterns in ITSDAQ ar...NB I think this is a FrontEnd specific thing, but there might be some restricted list of common things that can be done.
This does also imply that the current 32 words in TxCore are not sufficient.
The most common patterns in ITSDAQ are things like:
* cal pulse + delay + trigger (already present)
* BCR + delay + cal pulse + delay + trigger
* cal pulse + delay + trigger + delay + trigger (ie the calibration pulse is timed for one trigger and not the other)
* n * (trigger + delay) (n 2 to at least 8, but maybe larger)
For more diagnostics, ITSDAQ has a fully customizable (from json) trigger pattern generator.
As an example (not a useful pattern) from itsdaq:
```
"command": [
"idle 5",
"l0 1111 2",
"l0 0011 4",
"l0 1001 +14",
"l0 1000 +1",
"l0 1000 +1 BCR",
"fast 2 2",
"fast 3 2",
"fast 6 2",
"fast 10 2",
"reg abc read 6",
"reg abc read 7",
"reg abc read 8",
"reg abc read 0x46",
"idle 3",
"reg abc write 2 0x01001067",
"reg abc read 2",
"reg abc write 2 0x03c01067",
"reg abc read 2",
"reg abc write 2 0x02001067",
"reg abc read 2",
"idle 2"]
```
@otoldaie , @ztaohttps://gitlab.cern.ch/YARR/YARR/-/issues/205Measure the performance of calibration scans2024-03-27T20:49:43+01:00Alex ToldaievMeasure the performance of calibration scansA calibration scan consists of 3 parts:
* configure the FEs
* readout the calibration data (i.e. produce the `FrontEndData`)
* plot and analyze it
Ideally:
* the readout part of the calibration (the `HWController` push to `DataProcessor...A calibration scan consists of 3 parts:
* configure the FEs
* readout the calibration data (i.e. produce the `FrontEndData`)
* plot and analyze it
Ideally:
* the readout part of the calibration (the `HWController` push to `DataProcessor`, which pushes the `FrontEndData` to the calibration analysis) runs as fast as the triggering (e.g. 500 triggers / 10kHz = 0.05s).
* And the triggering is done at the HW limit frequency for the full occupancy data packets.
Then, we can push the readout to its HW limit, i.e. the HW limit of the triggering. And in the overall calibration, the limit will be the analysis part.
Factors:
* Felixcore or felix-star & rdma? (matters only if the network is indeed a bottleneck now)
* Number of triggers per iteration (should just scale, we do not send too many triggers for the calibrations)
* Trigger frequency (what is the HW limit for full occupancy packets? Do we need FW triggers to reach it?)
* Run from 1 YARR & many – if the network & HWhandler make a bottleneck (and to speed up the plotting)
* Run with and without the analysis, only saving the data to the disk
* Also, try to save the output data to a memory-mounted disk.
* Then, at some point, we could also try the hit counters.
So, we need to start from our standard setup: Felixcore, 500 SW triggers at 10kHz, no memory-mounted disk, with all of the analysis and try to run from 1 YARRs. See the profile, identify the current bottleneck. Then, according to what’s needed, try multiple YARRs, try memory-mounted disk, etc. If everything is perfect, we will just increase the trigger frequency.
Together with the profile, we need to measure these times:
* The time to configure – Yarr already measures that, right?
* HWController’s handler (push) – this one must be within the triggering limit, otherwise it is the bottleneck
* StdDataLoop iteration & time between iterations
+ within it: DataProcessors parsing (pop)
* Also: analysis time & time to save to the disk
And more metrics:
* cache hits
* CPU occupancy?
* memory occupancy?
# Commands to use
## `perf` flamegraph profile
```
sudo perf record -F 99 -g -- <scanConsole command>
# or with /bin/time
sudo perf record -F 99 -g -- /bin/time <scanConsole command>
# it may work without sudo!
# I am not sure if it will save all stack frames then (the ones from inside linux too?)
# produces a perf.data file
# -F sets the frequence of 99 samples per second -- increase if more statistics is needed
# the flamegraph from the perf.data file:
sudo perf script | stackcollapse-perf.pl > out.perf-folded
cat out.perf-folded | flamegraph.pl > perf-kernel.svg
```
It needs `stackcollapse-perf.pl` and `flamegraph.pl` Perl [scripts](https://github.com/brendangregg/FlameGraph).
## Cache hits, CPU, memory
The cache hits, CPU, etc [cannot be obtained from perf](https://stackoverflow.com/questions/62550369/run-perf-stat-on-the-output-of-perf-record?rq=3) simultaneously with the `record` of the call stack profile. So, they will have to be run separately:
```
# CPU and cache hits:
perf stat <scanConsole>
# memory usage:
# TODO
```
## Additional time counters inside YARR
For HWController, [`NetioHandler`](https://gitlab.cern.ch/YARR/YARR/-/blob/master/src/libNetioHW/NetioHandler.cpp#L65) passes a lambda as the handler. Its scope will not allow for a time counter. Make a dedicated `NetioHandler` method for the handler?
In [`FelixRxCore::on_data`](https://gitlab.cern.ch/YARR/YARR/-/blob/master/src/libFelixClient/FelixRxCore.cpp#L131), it's straightforward.
In [`StdDataLoop`](https://gitlab.cern.ch/YARR/YARR/-/blob/master/src/libYarr/StdDataLoop.cpp#L36), we need the times `exec2 - exec1` and `exec1 - exec2` for the iteration time and between-iterations:
```
// src/libYarr/include/StdDataLoop.h
+#include <chrono>
+using Clock = std::chrono::steady_clock;
class StdDataLoop: public LoopActionBase, public StdDataAction {
...
+
+ // additional timings for calibrations performance
+ std::chrono::time_point<Clock> exec1_time;
+ std::chrono::time_point<Clock> exec2_time;
+ std::chrono::microseconds time_of_iteration(0); // initialize with 0
+ std::chrono::microseconds time_between_iterations(0);
+ bool started_iterations = false;
};
// src/libYarr/StdDataLoop.cpp
+StdDataLoop::~StdDataLoop() {
+ SPDLOG_LOGGER_INFO(sdllog, "Time of iterations {} [us]", time_of_iteration.count());
+ SPDLOG_LOGGER_INFO(sdllog, "Time between iterations {} [us]", time_between_iterations.count());
+}
void StdDataLoop::execPart1() {
+ exec1_time = Clock::now();
+ if (started_iterations) {
+ time_between_iterations +=
+ std::chrono::duration_cast<std::chrono::microseconds>(exec1_time - exec2_time);
+ }
+ else started_iterations = true;
+
...
}
void StdDataLoop::execPart2() {
...
+
+ exec2_time = Clock::now();
+ time_of_iteration +=
+ std::chrono::duration_cast<std::chrono::microseconds>(exec2_time - exec1_time);
}
```
And in the [processing](https://gitlab.cern.ch/YARR/YARR/-/blob/master/src/libStar/StarDataProcessor.cpp#L87): add up the time inside the `while` loop of `StarDataProcessor::process_core`.https://gitlab.cern.ch/YARR/YARR/-/issues/237Fix the connection timeout of the socket for checkChannel()2024-03-26T23:23:17+01:00Angira RastogiFix the connection timeout of the socket for checkChannel()Need to dig into the actual time until when the connection from felix-client to netio socket lasts. Right now, we have a placeholder time of 5 secs [here](https://gitlab.cern.ch/YARR/YARR/-/blob/devel/src/libFelixClient/FelixTxCore.cpp#L...Need to dig into the actual time until when the connection from felix-client to netio socket lasts. Right now, we have a placeholder time of 5 secs [here](https://gitlab.cern.ch/YARR/YARR/-/blob/devel/src/libFelixClient/FelixTxCore.cpp#L64). After this, the checkChannel() can send another send_data() call to re-establish the connection.
However, something to think about later is that if such a check is needed at all once YARR is integrated into the TDAQ state-machine like environment.Angira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/235Running eye diagram scans for FELIX-based readout2024-03-26T22:01:36+01:00Angira RastogiRunning eye diagram scans for FELIX-based readoutAim to create a useful utility script for users which scans over the important chip register settings to get the most optimum configuration for reading out data from FE. To be designed based on the soft error counters of the FELIX firmware.Aim to create a useful utility script for users which scans over the important chip register settings to get the most optimum configuration for reading out data from FE. To be designed based on the soft error counters of the FELIX firmware.Angira RastogiLaura Clare NoslerAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/234Configure FELIX registers for firmware-based trigger generation during the sc...2024-03-26T21:56:33+01:00Angira RastogiConfigure FELIX registers for firmware-based trigger generation during the scan with felix-starTo get rid of running an extra configuration script, before the actual scan, which sets the FELIX registers with the calibration injection and trigger sequence for each scan- and FE-type. This step can be automated through the scanConsol...To get rid of running an extra configuration script, before the actual scan, which sets the FELIX registers with the calibration injection and trigger sequence for each scan- and FE-type. This step can be automated through the scanConsole command, based on the controller type, the front-end type from connectivity file and the scan config.Angira RastogiLaura Clare NoslerAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/233Reading chip registers with FELIX2024-03-26T21:45:33+01:00Angira RastogiReading chip registers with FELIXThis is a very important step towards running full module electrical QC tests at LLS sites with a FELIX setup.This is a very important step towards running full module electrical QC tests at LLS sites with a FELIX setup.Angira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/232Test data transmission with ITkPix triplet module using channel-bonding FELIX...2024-03-26T21:41:35+01:00Angira RastogiTest data transmission with ITkPix triplet module using channel-bonding FELIX firmware for 4-lane readoutNew FELIX firmware exists which supports multi-lane readout. Need to test running YARR calibrations with this new firmware.New FELIX firmware exists which supports multi-lane readout. Need to test running YARR calibrations with this new firmware.Angira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/231Running data merging with ITkPix quad module V22024-03-26T21:37:52+01:00Angira RastogiRunning data merging with ITkPix quad module V2Currently, we are unable to get "DECODING LINK ALIGNMENT" of the V2 module with FELIX test stand. Based on discussions with Sasha, this issue has been seen before at other sites while running with a V1.1 module as well. It was traced bac...Currently, we are unable to get "DECODING LINK ALIGNMENT" of the V2 module with FELIX test stand. Based on discussions with Sasha, this issue has been seen before at other sites while running with a V1.1 module as well. It was traced back to the R1 & R2 resistors on data adapter card which are not needed. Alternatively, once needs to understand the optoboard equalizer settings.Angira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/229Adding warning in FelixRxCore for incorrect polarity of idles from chip2024-03-26T21:18:24+01:00Angira RastogiAdding warning in FelixRxCore for incorrect polarity of idles from chipThis check can be a useful debugging step early-on for connectivity issues in the setup which will lead to failing YARR calibration scans.This check can be a useful debugging step early-on for connectivity issues in the setup which will lead to failing YARR calibration scans.Angira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/230Creating default chip configuration for scanConsole compatible with FELIX setup2024-03-26T21:17:16+01:00Angira RastogiCreating default chip configuration for scanConsole compatible with FELIX setupFrom discussion with various LLS users and others, it seems like a good idea to have a pre-adjusted default single chip configuration which will work out-of-the-box when running YARR scans with FELIX. This way we do not rely on users, wi...From discussion with various LLS users and others, it seems like a good idea to have a pre-adjusted default single chip configuration which will work out-of-the-box when running YARR scans with FELIX. This way we do not rely on users, with limited understanding of the various chip global registers, to tune the config file correctly.Angira RastogiAngira Rastogihttps://gitlab.cern.ch/YARR/YARR/-/issues/226rd53b/std_noisescan.json blows up in memory, with all pixels masked2024-03-23T00:52:41+01:00Giordon Holtsberg Starkrd53b/std_noisescan.json blows up in memory, with all pixels maskedWhen setting a noise scan to 48 hours (e.g. continuous) we are unable to get our scan to run longer than 30 minutes to an hour.
![Screenshot_2024-03-07_at_12.55.46_PM](/uploads/5d8dd5987344a58e62688655fad1fc6c/Screenshot_2024-03-07_at_1...When setting a noise scan to 48 hours (e.g. continuous) we are unable to get our scan to run longer than 30 minutes to an hour.
![Screenshot_2024-03-07_at_12.55.46_PM](/uploads/5d8dd5987344a58e62688655fad1fc6c/Screenshot_2024-03-07_at_12.55.46_PM.png)
Looking at `/var/log/messages` on the machine, we have
```
Mar 7 12:50:20 localhost kernel: Out of memory: Kill process 19057 (scanConsole) score 954 or sacrifice child
Mar 7 12:50:20 localhost kernel: Killed process 19057 (scanConsole), UID 1000, total-vm:64005232kB, anon-rss:30789092kB, file-rss:0kB, shmem-rss:0kB
```
and here's a link to grafana for the systemstats for the ~1 hour period where this was ran: https://itkpix-srv.ucsc.edu/grafana/d/W7Wf8xNVk/system-stats-node-exporter?orgId=1&from=1709841039633&to=1709844980580
![Screenshot_2024-03-07_at_1.06.30_PM](/uploads/e7a69477e15f709b2e5ab5144994545e/Screenshot_2024-03-07_at_1.06.30_PM.png)
and the noise scan in our localDB is here: https://itkpix-srv.ucsc.edu/localdb/component?id=64cb645b2037e1004253b928&collection=component&test=electrical&runId=65ea3449296d938f848a709cTimon HeimLuc Tomas Le PottierTimon Heimhttps://gitlab.cern.ch/YARR/YARR/-/issues/228ITkPixV2 decoder fails to decode a single-hit QCore near the end of the stream2024-03-20T10:48:27+01:00Ondrej KovandaITkPixV2 decoder fails to decode a single-hit QCore near the end of the streamIf the last QCore in the stream contains only one hit, and ends close to the end of the last block, that hit is not decoded. The problem is most likely here: https://gitlab.cern.ch/YARR/YARR/-/blob/devel/src/libItkpixv2/Itkpixv2DataProce...If the last QCore in the stream contains only one hit, and ends close to the end of the last block, that hit is not decoded. The problem is most likely here: https://gitlab.cern.ch/YARR/YARR/-/blob/devel/src/libItkpixv2/Itkpixv2DataProcessor.cpp?ref_type=heads#L319
It seems that as it tries to retrieve the 16 bits to determine the hit map, it collects as well the only present ToT (4 bit) and hits the end of the stream before the 16 bits are all retrieved, and then quietly stops decoding. This is not observed in higher-occupancy events with more ToT bits or in RD53b where this is protected by the orphans.https://gitlab.cern.ch/YARR/YARR/-/issues/227Optimising Global Threshold Tuning2024-03-08T20:42:36+01:00Liam FosterOptimising Global Threshold TuningImplement variable step sizes in global threshold tuning as a function of occupancy as part of new two-step tuning procedure
Pass occupancy through to Itkpixv2GlobalFeedback, where it will be used to determine step size, in a similar wa...Implement variable step sizes in global threshold tuning as a function of occupancy as part of new two-step tuning procedure
Pass occupancy through to Itkpixv2GlobalFeedback, where it will be used to determine step size, in a similar way that sign is done nowLiam FosterLiam Foster