Handle error banks

I would suggest the following:

Add a method to IInputProvider that returns a vector<unsigned char> which is 1 for each event in the batch that has an error bank. This should be set by checking all bank types, not just the ones that will be needed on the device.
Add an algorithm that copies the "at least one error bank" vector to the device.
Add a line that fires if there is at least one error bank.
Update the global event cut to use an event list instead of initializing it
Always initialize the event list with the trivial initialization algorithm
Add an error-event filter algorithm
The global event cut, all reconstruction algorithms and all lines other than the error-event line should require the error-event filter

Motivation for the above:

Events with error banks should be detected for all subdetectors, not just those reconstructed in HLT1. Otherwise error banks cannot be properly monitored by the shifters and subdetector monitoring tasks
Events with at least one error bank in any subdetector should not be reconstructed; i.e. if there is an error bank in any subdetector the data of that event is assumed to be bad.
The global event cut should become composable with other filters

Anything to add @graven @gligorov @rmatev?

at first glance the above looks reasonable & complete to me... (and also on a second reading ;-)

One note for the future is that some kind of rate scaling should probably be implemented to avoid overloading the output network/DAQ if a subdetector suddenly starts sending only error banks. From experience in Run 2 this is quite likely to happen at some point.

We don't have rate scalers yet, @graven do you have any idea how that could be implemented taking into account that the MEP packing factor is expected to be about 30k and events are processed in batches of about 1k.

Knowing the filling scheme, some data from ODIN - downscales, physics triggers, other triggers, etc. - and the number of active Allen processes could perhaps be used to infer the fraction of data that is received by all HLT1 processes. I don't know if that would be sufficiently reliable or that a reference rate (perhaps a special event type from ODIN?) would work better.

I think it is worth to discuss whether events containing an error bank should really not be processed at all. They could just be missing data from a small sub-portion of the detector. When discussing with @gvouters @mfontana @ascarabo the idea came up to still process data with error banks, but flag them with a trigger line. This works under the assumption that the error banks still contain raw data in the same format as normal raw banks for those detector parts w/o errors.

I think we should factorize the problem into 2 independent parts:

events with error banks should be accepted by a dedicated line (and we should discuss how to make sure that this won't flood the system with accepted events -- either by rate-limiting, or by some trivial scaling which would eg. require some finite amount of time before the next one would be accepted (*), or set a maximum of such triggers per batch (**)
whether events with error banks can be processed as usual as well actually depends on the details of subsystem readout. So it has to be figured out case-by-case.

(*) like a rate limiter, this implies that the decision made are not entirely deterministic -- which can be traced back to the fact that a single event doesn't define a 'rate' -- you need multiple events to define a rate... (**) note that a batch actually can be used to define a rate, so this is probably the easiest way to get a reasonable approximation ;-)

I would advocate a safety first approach: we do not try to reconstruct events with error banks online during the commissioning period. Once we get some experience with the system and we have some time to look at them offline we can, depending on their frequency, discuss whether it is appropriate to dedicate developer time to addressing (2) in @graven comment. But I think only (1) should be addressed for now to save energy for more pressing priorities.

I agree with that, since nothing has been really discussed for the error banks, I'm not even sure what information and data format you would get for some error banks during the beam test. But we should definitely initiate a discussion about what to do for each of them. We have several errors possible with different impacts on the data ... And since this might lead to some requests on the firmware side, it would be great to do that before the end of the year.

I agree with both @graven and @gligorov. From the HLT1 point-of-view, events with error banks are not physics or calibration data and should therefore not be reconstructed.

Events with error-banks are critical to the monitoring of all sub-detectors and should therefore be made available to the various sub-detector specific monitoring tasks. That's what the dedicated line is for. A routing bit will be associated to facilitate routing of these events by the DAQ. It may also help with commissioning if they are written to a separate set of files for debugging purposes, but that is outside of the scope of HLT1.

But then we need to define what's an error bank. If one link is failing on one TELL40, the bank is tagged as ERROR for now on that TELL40. But this means you will lose all the physics data of the other links for just link failing ? This is not efficient.

I'm operating under the assumption that the amount of time there will be any error banks for whatever reason is sub-permille level, i.e. we don't care about losing that data for physics/calibration. If this assumption turns out to be wrong, we can come up with something better then.

At the same time, if there are genuine errors more often than a permille of the events that also contain good physics/calibration data, I think there is a bigger problem to solve.

For commissioning purposes the error line as proposed will make sure events can be monitored and stored for debugging.

In terms of what an error bank is; I'm assuming that any bank type above 192 (DAQ errors) or with the word error in their name is an error bank.

If the decision is to discard error events, then in the firmware side, I'll ask to reduce the size of the bank to just keep an error information field to reduce the output bandwidth. This will actually facilitate firmware developers life.

Error events will not be discarded, but will not be considered by physics/calibration triggers. Instead they will be tagged and then accepted by the Hlt1ErrorEvent trigger line to allow them to be forwarded to sub-detector monitoring tasks and storage

Some safety measures will be put in place to avoid overloading the system if there are too many error banks being sent, e.g. exceeding the bandwidth to the monitoring farm or capacities of error-bank monitoring jobs; to be decided and tuned. Depending on what we encounter, these safety measures may have to be subdetector specific such that one subdectector sending a lot of error banks does not mask them for another subdetector in the monitoring farm.

@gvouters the issue is not so much if one link is failing on one TELL40 but rather whether this is an intermittent issue or a known dead link. A known dead link which cannot be fixed would need to be masked of course. But for intermittent issues we have generally taken the view iirc that we prefer to take perfect data only.

mentioned in issue #290 (closed)

closed

Handle error banks

Designs

Child items 0

Activity