Skip to content

Implement a transparent & periodic OptoHybrid FPGA SEU recovery procedure

Laurent Petre requested to merge feature/oh-fpga-seu-reprogramming into main

Description

As electronics is exposed to radiation, Single Event Upsets (SEU) can be observed. This is particularly true for FPGA which store their configuration in SRAM. The SEU can have no impact or lead to short-duration functionality error (if corrected) or long-duration functionality error (if not correct). In order to mitigate the issue, the OptoHybrid FPGA configuration memory is continuously scrubbed for errors due to SEU. Whenever correctable, they are corrected. However, some errors cannot be corrected (e.g. two errors in the same configuration frame). It is regularly observed in operations at P5 that the FPGA temperature cannot be read out anymore, leading to power off of the chamber for safety reasons.

Building on top of the recent high-granularity local resets firmware change, this merge request aims at re-configuring from scratch any OH FPGA for which uncorrectable SEU are detected. The configuration sequence is first updated to work without TTC HardReset since less impactful firmware features are now sufficient to configure the front-end (and since all automatic actions upon TTC HR are now removed).

The second commit implements the full re-configuration of an isolated OH FPGA, without any side effect on an ongoing run: the DAQ path is left untouched, with fully valid event building, and the other OH/chambers are not impacted, including their counters. In order to avoid sending garbage to the EMTF, the TX trigger links are disabled before the operation and re-enabled after a success. Note that nothing can be done on the OTMB side since the data is transmitted directly from the OH to the OTMB and no signalization endpoint exists on the CSC side. The functionality is made available through a new "Expert Page" in the AMCManager application.

Finally, the automatic VFAT masking is enhanced into an "automasker", which includes recovery features. At the moment, only the OH FPGA recovery upon uncorrectable SEU error is implemented as a proof of concept. It remains disabled by default. More recovery features can (and should and will) be implemented. That would require the construction of an actual full-featured automasker class/system keeping track of the number of errors detected, number of recoveries attempted,... for bookkeeping purposes. Note that the introduction of the automasker required small backward incompatible adjustments in xDAQ XML configuration parameters of the AMCManager application.

Related Issue

No GitLab issue... 😕 But multiple e-logs, including the "master" one: http://cmsonline.cern.ch/cms-elog/1211428

How Has This Been Tested?

Thoroughly tested on the GE1/1 integration setup in building 904 @ CERN. Deployed at P5 on the 7th of June 2024 and working like a charm since then. The OH FPGA re-configuration fulfillment as well as the absence of side effects have been tested both in manual mode and in automatic mode (with the injection of correctable and uncorrectable SEU errors).

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Merge request reports