I ran a few preliminary scalability measurements on 1-8-14 HCCs, temporary branch. The broadcast elink with send_data_nb
seems to perform well. The tests results also seem to make sense and look pretty good. I will post something soon on the measurements issue.
However, there were a couple cases when the trigger thread somehow was a bit too slow. E.g. it took more than 1s in some iterations with 50K triggers at 50KHz:
[FelixTxCore] Finished trigger in 1590258 us
[FelixTxCore] Finished trigger in 1366441 us
[FelixTxCore] Finished trigger in 1098479 us
[FelixTxCore] Finished trigger in 1114197 us
[FelixTxCore] Finished trigger in 1073770 us
In other cases, 50K triggers at 50KHz were issued in 1s with no problems.
Ok, it looks like broadcasting with send_data_nb()
in FelixTxCore::trigger
may be the best way to go. It sent up to 60K triggers and received all the replies back. The replies were empty packets, due to the high BVT setting. But the main thing is that the triggering is indeed fast enough, and the triggers did not get corrupted in the TX direction.
Strangely, when using the blocking call felixclient::send_data
to send the triggers, the broadcast elink seems to be slower than the command elinks. And it is somewhat slower when the broadcast is done to multiple elinks. Not sure why that is so, but it seems to be a reproducible fact.
If it sends 20KHz, we can emulate an average-40KHz rate in a hacky way by just sending 2 L0A in 1 trigger word. That should get it close enough to the 50KHz limit.
Indeed, this branch may not be needed, if the standard triggers can reach 20KHz. I will test it more on broadcast elink, with more HCCs and at full occupancy.
I originally aimed to reach about 50KHz with SW triggers, as that seems like a good margin for full occupancy packets:
max_packet_size = 2 * (10*64 + 2) [B] // 10 ABCs, each sends 64 16-bit clusters, plus 1 header and 1 footer
max_rate ~= 1GBps / (14 * max_packet_size) ~= 56KHz
Yes, I ran a few quick tests, and it runs much faster than last year. Basically, it should be possible to redo the Felixcore scalability measurements with the broadcast elink up to 20Khz rate. Because it looks like felixclient sends the triggers fast enough up to 23-24KHz, and does not keep up with the rate only above that. Last year, on centos7 with the old release etc, it seemed like 1-2KHz was the limit, even on only 1 HCC.
The tests ran on 1 TX elink & 8 RX, with a high BVT threshold so that the RX packets are empty, at trigger frequencies 10-20-30-40-50 KHz. And the number of triggers = 10-...50K, so that the triggering time is supposed to take 1 second.
I ran it on the More robust StdDataLoop MR, because StdDataLoop prints the time when it notices the trigger thread is done. But to be sure, I also added the timing in FelixTxCore::doTriggerCnt
:
void FelixTxCore::doTriggerCnt() {
clk::time_point time_start = clk::now();
...
auto timeElapsed =
std::chrono::duration_cast<std::chrono::microseconds>(clk::now() - time_start);
ftlog->warn("Finished trigger in {} us", timeElapsed.count());
}
Both timings, from StdDataLoop and the FelixTxCore, are basically the same. Also, the first iteration takes longer than others. So, I look only at the last iteration. And the triggering time grows like this:
rate | time for 1 TX (8 RX) | time 2 TX (14 RX) | broadcast to 1 TX | brc to 2 TX | brc 2 TX with send_data_nb |
---|---|---|---|---|---|
10Khz | 1s | 1s | 1s | 1s | 1s |
11Khz | 1s | ||||
12Khz | 1.06s | ||||
15Khz | 1.14s | 1s | 1.01-1.1s | 1s | |
18Khz | 1.36s | ||||
20Khz | 1s | 1.67s | 1s | 1.11-1.26s | 1s |
22Khz | 1s | 1.7s | 1s | 1.23-1.37s | |
23Khz | 1.02s | 1.05s | 1.3-1.37s | ||
24Khz | 1.11s | 1.14s | 1.35-1.6s | ||
25Khz | 1.3s | 2.3s | 1.05s | 1.44-1.45s | 1s |
30Khz | 1.3-1.5s | 1.24s | 1.75-1.82s | 1s | |
35Khz | 1.5-3s | 2s | 1s | ||
40Khz | 1.75s | 1.7-1.8s | 2.3-2.5s | 1s | |
50Khz | 2.11s | 1s | |||
60Khz | 1s |
(For some reason, at 30KHz, the first iteration took 1.3s which was less than the last one.)
yes, 20MB is daunting indeed. And I thought about mentioning it to TDAQ as an argument that we need to implement a special word in FELIX LCB encoder that would generate multiple IDLEs according to some count, like 256 at a time. So, 1k triggers at 1KHz would be 80KB. It could also have some limit of granularity: generate 10 IDLEs for each count, i.e. from 10 to 2560 IDLEs at a time.
Indeed, in the case of multiple channels, it will have to send the whole sequence to each one. I think it's really viable only on broadcast elink.
In connection with this, I also wanted to ask if it is possible to optimize FelixTxCore::setTrigWord
to just store the pointer to the trigger word, instead of caching it in a private buffer:
void FelixTxCore::setTrigWord(uint32_t *words, uint32_t size) {
m_trigWords.clear();
for (uint32_t i=0; i<size; i++) {
m_trigWords.push_back(words[i]);
}
}
Since g_tx->setTrigWord
gets called on every StarTriggerLoop::init
, it can be expensive.
An option stackTriggers
in StarTriggerLoop
. If it is set to true and the trigger mode is count (i.e. the trig_count
>0), the Loop sets up its trigger word to include all the triggers, separated by IDLEs, with or without the injections. And it sets the HWController to issue the word only once.
The goal is to be able to rerun the scalability measurements that were done on Felixcore last year. The problem last year was that on felix-star it would not push the trigger rate over 1-2KHz, even on 1 HCC.
Also, corrected random inconsistency in tabs VS spaces: StarTriggerLoop
uses tabs and FelixTxCore
uses spaces.
modified: src/libStar/StarTriggerLoop.cpp
modified: src/libStar/include/StarTriggerLoop.h
modified: src/libFelixClient/FelixTxCore.cpp
3 minor improvements:
maxConsecutivePushes
. If set to >0, it limits the number
of pushes to the data processors before looking at their feedback. I.e.
the maxConsecutiveRxReads
roughly limits the size of RawDataContainer
s
and maxConsecutivePushes
limits the number of containers that are pushed before StdDataLoop
checks the processing counts.The goal is to handle large streams of data or autogenerated data, like HPRs in Strips.
HPRs are not currently separated from data in FELIX FW. We turn them ON for debugging,
but then they clog StdDataLoop
if you debug many HCCs and ABCs.
Admittedly, the debugging could also be done with DataGatherer
.
But StdDataLoop
is handy, because it specifies the number of triggers you expect.
Filtering HPRs in FW will take some thinking time, because Strips have many RX elinks and duplicating them just for HPRs and register reads seems like a waste. It would be nice to aggregate HPRs in few FW elinks in some meaningful way. E.g. to timestamp them etc. All of that needs thinking. Meanwhile, let's just try to handle them in software to some reasonable extend.
Also notice, the parameter value 0 means setting no limit in both
maxConsecutivePushes
and maxConsecutiveRxReads
.
So, if they are both set to 0 and maxIterationTime
is set to 0,
StdDataLoop::execPart2
behaves exactly like before the processing feedback was added:
it reads the RxCore
until there is no more data, pushes everything it got to processing,
and returns.
Then, StdDataLoop
logs on DEBUG
level when it notices that the trigger
thread is done. It is a nice metric to log. And it is needed to measure how fast felixclient can send
triggers. Currently, Strips StarTriggerLoop
code sends triggers one at a time to each TX
elink, and felixclient cannot reach above 1-2KHz rate.
Finally, it logs the empty cycle statistics only to TRACE
level instead
of DEBUG
, to keep the log cleaner. It may also be reasonable to just
skip logging empty cycles altogether.
modified: src/libYarr/StdDataLoop.cpp
modified: src/libYarr/include/StdDataLoop.h
size_t m_maxMessageSize in FelixRxCore. If it is >0 on_data compares it with the message size_t size.
It helps in debugging problematic cases, like the parsing errors in Strips.
modified: src/libFelixClient/FelixRxCore.cpp
modified: src/libFelixClient/include/FelixRxCore.h