Muon Decoding Errors and assert break

changed the description

@samarian can you please have a look ASAP if this will also affect the Allen muon decoding?

I do see the same by running on data yes, which is expected considering that the Allen decoding is mostly ported from MuonRawInUpgradeToHits. I'll try to dig a bit, but let me also cc @bsciasci e @satta

changed the description

Interesting thanks…

Side comment but we should start to update/clone the decoding tests we have in LHCb etc to run tests on real pit data…

It's a very important comment, but until now we've simply been too short-handed to be able to get on this. We need to prioritise it though.

it wasn't a complaint, more a comment to self..

First thing is to collect some MDF files from the pit, and make them available in EOS etc. Right now that might mean different ones for different sub-systems, as not everyone is in global yet... Or do we hve some runs where everyone (other than RICH sadly...) is available ?

lets move discussions on decoding tests to #243

To be correct, it is not a segfault per se, it is an assert that fails, so somewhere there must be a logic error in the code. Meaning, it runs without a crash in the opt build.

I see the same thing on MC (when running on new samples), so this is not data specific.

okay, I modified the title to be clear. Apart from the assert, there is error about Muon bank is too short seen in data as posted.

The errors in the opt builds should be ignored for now. Asserts are only tested in debug builds, in the opt builds they are skipped, so clearly what you see in the opt builds is just a consequence of whatever reason the assert is failing, and its this the debugging should focus on.

I think there is a mix up: It is quite likely these are two different issues, as on MC I see the condition for the assert failing all the time (I just implemented a check myself, running on the opt build), and the 100 events run through successfully.

Obviously there is still something wrong, but I doubt what fires the assert causes the crash after 140001 events.

can we run the same check on data? It can well be that something is different between data and simulation so the assert only fails in data very rarely

Well, it was not great science I did I just put in an if statement with a std::cout << "error" << std::endl;.

It is this line: https://gitlab.cern.ch/lhcb/LHCb/-/blob/master/Det/MuonDet/src/Lib/MuonTilePositionUpgrade.cpp#L109

I also gave a look to the muon bank too short crash in data and the reason seems to be that the event 144316 is the first containing a bank with raw_bank -> size() = 0. This happens for one sourceID in many events:

EventSelector                       SUCCESS Reading Event record 144316. Record number within stream 1: 144316
BANK SIZE 0
BANK SOURCE ID 28686
BANK TYPE 13
EventSelector                       SUCCESS Reading Event record 144317. Record number within stream 1: 144317
EventSelector                       SUCCESS Reading Event record 144318. Record number within stream 1: 144318
BANK SIZE 0
BANK SOURCE ID 28686
BANK TYPE 13
EventSelector                       SUCCESS Reading Event record 144319. Record number within stream 1: 144319

and hits this control. By changing:

-- if ( (unsigned int)raw_bank->size() < 1 ) { OOPS( EC::RawToHits::ErrorCode::BANK_TOO_SHORT ); }
++ if ( (unsigned int)raw_bank->size() < 1 ) continue;

I can successfully run over all events in the file, but I let muon experts (cc also @dbrundu ) to comment if this is desirable

I added Michel's hack to print out more info with the real data in the debug mode, so it seems to me that this line should be blamed for the assert break.

 resize m_posPad size with channels: 23040
 assert .. GridX:  0 station: 0 region: 0 posPad[0] size 23040
 assert .. GridY:  0 station: 0 region: 0 posPad[0] size 23040
 assert .. GridX:  1 station: 0 region: 1 posPad[1] size 18432
 assert .. GridY:  1 station: 0 region: 1 posPad[1] size 18432

There are also cases, with unrealistic values, which I was wondering if it is because the m_podPad size is not initialized properly.

assert .. GridX:  12 station: 3 region: 0 posPad[12] size 12300831782225310151
assert .. GridY:  12 station: 3 region: 0 posPad[12] size 12300831782225310151

@samarian what you described seems suspiciously compatible with a hardware problem we are investigating: sometimes some data links suddenly send a high number of idle frames and cause a link de-synchronization. All subsequent events are tagged with a DAQ error type 0x5A by the TELL40, until a new synchronization is performed. I don't know if this could explain the null size of the raw bank. Anyway, the change you propose is fine, but maybe it would be better to generate a warning. Is it possible?

A seg. Fault is of course never acceptable, so whatever the decoding needs to do to handle these errors gracefullly, this should be done.

A warning/error is a good idea, but bear in mind what might seem a good idea when running a single job locally, might become an unmanageable stream of messages at the pit when running the full hlt, so be careful with such things.

as @jonrob says any warnings/errors should come with a strictly limited number of printouts before they are muted.

@ascarabo can you verify that your error banks lines correctly fire for these events?

yes if I run my sequence on it this happens:

I wonder if the DAQErrorBanks could really explain the null size of the error banks, given that the bank type is reported as 13? DAQErrorBanks should be of bank type 89 or larger if I'm not mistaken.

Here are some changes along the lines of what @samarian suggested: !3736 (merged)

changed title from Muon Decoding segfault to Muon Decoding Errors and assert break

changed the description

mentioned in issue #243

mentioned in merge request Allen!895 (merged)

Concerning the assert that stops the running on real data, we can clearly remove the assert and something different, but this is something that should never happens.. this is why there is this assert. In Run1-2 there was a similar checks and it never fails. Before changing it I think we need to carefully understand what is happening. I checked a bit and there are oredr of 10k events with this error. In the first half of then it happens only in one pci that it extends to other 2 pci which is quite strange. In our tell40 we have an header which is always written, In such corrupted banks also the header is missing. We need to understand if it can be due to HW upstream the tell40, to the Tell40 or to the EB

@satta I see your point but even if this "should never happen" an assert is not ideal from an operational point of view. So I would remove the assert immediately and then indeed try to understand what happens so you fix the cause and not only the symptom.

@jonrob @graven do you agree with my comment about asserts in general?

Asserts have a use, I use them in the RICH, but you have to bear in mind two things with them

They only apply in debug builds. In opt builds them are ignored.
If you hit them you immediately terminate the application.

For these reasons they are not the thing to use if you want to validation something in real data. even if you think its a 'this should never happen' as seen here these do sometime happen.

So yes, in this case something else would probably be appropriate.

I have never understood the value of Assert unless you are actively debugging a piece of code. As already said, you never see it when running in production, you just skip over it. So better to print an Error or similar, even if then the default behaviour is to ignore it.

As I said, they have their use. I use them in the RICH for the very purpose you mention, i.e. I use the debug builds myself when developing and have scattered the code with a bunch of checks I want to have when I run locally myself, but I do not want to limit the performance in the released optimized builds and nor do I need them to be checked there. Using asserts is perfect for this as I don't have to constantly comment them in/out when committing. I can just leave them in and know the opt builds ignore them.

But yes, you have to be careful how you use them. They shouldn't be used to validate some condition you expect 'to never happen' in the data processing loop, for instance.

@satta @samarian @jonrob could we please proceed with a pragmatic fix here ASAP? It is now extremely important to have stable versions of all subdetector decoding algorithms so we can really begin global reconstruction studies with data.

I am on holidays with poor network connection. I can work on it starting from the 17th

@gligorov fixing the assert was a trivial issue, if thats all the problem was I would have done it myself a while back. The underlying reason why the assert was being hit needs to be investigated, and that needs a muon system expert.

can I suggest that we pragmatically fix the assert first at this point, and implement @samarian proposed fix from above?

If @satta agrees and then someone (@samarian ?) opens a MR with the changes sure.

For me remove the assert Is ok. Then I will implement sonething at least to spot the fact that the data are corrupted.

Absolutely, a warning/error message with some max number of printouts before it goes silent is perfect from a datataking point of view here. Then what do you think about the proposed fix to the second issue which was observed?

Actually the assert was fixed at !3726 (diffs) (it actual condition was not correct and there was no underlying issue).

On the usefulness of asserts, I can just add that for commissioning we've been running (almost?) all monitoring tasks from -dbg builds and that has been really helpful (for when we need to attach a debugger). When we hit an assert, it is annoying but just going over means we might be ignoring some underlying issue.

If you feel asserts are useful by all means keep them, but shouldn't they come with an additional WARNING/ERROR message for the opt builds?

we can probably convert almost all asserts to exceptions without much loss in performance. (or easier to do, but slightly less useful (no stack trace online) => enable asserts in opt builds)

Sometimes the assert wraps some slightly less trivial computation though, like in https://gitlab.cern.ch/lhcb/LHCb/-/blob/f9964c0a98f69620370b5229646d78b94ceb6e7e/FT/FTDAQ/src/FTRawBankDecoder.cpp#L649 where there's a loop over a container.

@gligorov No. If I am using an assert it is specifically because I want there to be absolutely nothing in the opt builds. it would also require a lot of ugly preprocessor checks to do something different depending on if NDEBUG is set or not.

So no, its either

The author knows what they are doing and specifically only wants this cross check in debug builds, in which case the trivial

assert( some_condition );

is the way to go. or...

The author wants a runtime check in all builds, in which case they use

if ( !some_condition ) { throw ....; }

its one or the other, not a horrible hybrid.

OK I see, thank you. I let you sort out what is appropriate for this case.

In this case an assert is fine, as as @rmatev pointed out the issue was the condition specification itself had a bug.

assigned to @satta

mentioned in merge request !3727 (closed)

mentioned in issue Moore#443 (closed)

mentioned in merge request !3736 (merged)

mentioned in commit Allen@aa4d7a51

mentioned in merge request Allen!984 (merged)

mentioned in commit Allen@0d5c14a3

mentioned in commit Allen@c351f9dc

Muon Decoding Errors and assert break

Child items 0

Activity