bugfix and improvement for Devel_rd53a_felix_netio_multichip_rebase (!311) · Merge requests · YARR / YARR

Zijun Xu requested to merge devel_rd53a_felixNetio_multichip_rebase_zixu_Working into devel_rd53a_felixNetio_multichip_rebase Mar 15, 2021

bugfix and improvement for direct FELIX DAQ

==========================================================================

Issue 1) need to re-start felixcore for every single YARR scan:

solution:

a) update to the latest felixcore version felix-04-01-01-stand-alone (The rc version has bugs);

b) need to unsubscribe Rx sockets, !311 (0159358c)

now, can keep run YARR digital for hundreds times without re-start the felixcore. only need occasional restart the felixcore when some crash happened.

==========================================================================

Issue 2) the Sync FE is not working

solution:

a) need GlobalPulse for chip configration; !311 (fc8d980d)

b) the netio Tx was using buffered socket to send normal Cmds, and low_latency socket to send Trigger. I changed it to only use the low_latency socket.

For the buffered socket, the problem is that sometimes we need to keep a time gap between Cmds, but the buffering could absorb the time gap. So, to keep things simple, I only use the low_latency socket for Tx. !311 (9665d456)

Now, Sync FE is working fine for digital scan. (may still have issue for analog or other type of scan, but that's different issue, or intrinsic "feature" of the Sync FE)

==========================================================================

Issue 3) some pixels (in special pattern) lost 1 hit

this is observed and reported by Egor long time ago Click to Resize

This issue happened once in ~10 digital scan, on average.

we were guessing that, the missing hits are because of a data packet lost in the chain, FELIX FW -> FELIX SW -> YARR SW. Marco is pretty sure that no packet loss between FELIX FW to FELIX SW. And the YARR SW is always suspicious. After testing with the newio_cat, I could confirm that YARR SW and the netio_cat always receive same packets.

After a long chasing, finally, I believe the problem is happened in the Trigger Cmd sending.

see here: https://gitlab.cern.ch/YARR/YARR/-/blob/devel_FelixNetIO_StarChip/src/libRd53a/Rd53aTriggerLoop.cpp#L47-67

Each trigger() func will send the 16 words.

word[15, 14] are hit injection, https://gitlab.cern.ch/YARR/YARR/-/blob/devel_FelixNetIO_StarChip/src/libRd53a/Rd53aTriggerLoop.cpp#L48

word[14-(delay/8)-i] is the Rd53a Trig Cmd, https://gitlab.cern.ch/YARR/YARR/-/blob/devel_FelixNetIO_StarChip/src/libRd53a/Rd53aTriggerLoop.cpp#L60

by default, the delay value is 56. that means, the Trig cmds are in word[6, 7]

all the 16 words must be sent to rd53a in sequence, and continuously without any stop. Then, we will receive all the expected event headers and hits.

But if in the felix SW/FW, for some reason, hit injection words and trigger words are split into 2 packets. Then, there will be a time gap between then. In this case, we still receive the expected number of event headers, because all the Trig Cmd is still valid. But because of the latency between hit injection and trig Cmd is too large, we will never see any hits from rd53a.

So, I reduced the delay value from 56 to 24, for example, the distance between the Trig word[10,11] and hit injection word[14,15] is shorter than before. Now, I never see this missing hits issue in more than 500 times of digital scan. (reminder, this issue was happening once in ~10 digital scan, on average). So, looks like the word-split does not happened.

So, this is a workaround: reduce the delay value, !311 (77eb7908)

But we need a real solution: need to check felix FW/SW to confirm the word-split issue. and need to find a way to protect the a blob of words not be further split to smaller object.

Edited Mar 15, 2021 by Zijun Xu

bugfix and improvement for Devel_rd53a_felix_netio_multichip_rebase

Merge request reports