While trying to run reconstruction monitoring with Muon decoding included in the pit (background data: /hlt2/objects/LHCb/0000240740/Run_0000240740_20220803-114650-930_MAEB04.mdf), we saw the following break with x86_64_v2-centos7-gcc11-dbg build. This is also seen when testing with latest MC samples.
Running with x86_64_v2-centos7-gcc11-opt after 140001 events, there comes another error:
MuonRawInUpgradeToHits ERROR virtual MuonHitContainer LHCb::MuonUpgrade::DAQ::RawToHits::operator()(const EventContext&, const LHCb::RawEvent&, const DeMuonDetector&, const LHCb::MuonUpgrade::ComputeTilePosition&) const : Muon bank is too shortMuonRawInUpgradeToHits ERROR Maximum number of errors ( 'ErrorMax':1) reached.HLTControlFlowMgr FATAL Event failed in Node MuonMonitorAlg/MonitorMuonPIDMuonRawInUpgradeToHits ERROR virtual MuonHitContainer LHCb::MuonUpgrade::DAQ::RawToHits::operator()(const EventContext&, const LHCb::RawEvent&, const DeMuonDetector&, const LHCb::MuonUpgrade::ComputeTilePosition&) const : Muon bank is too shortMuonRawInUpgradeToHits ERROR Maximum number of errors ( 'ErrorMax':1) reached.
I do see the same by running on data yes, which is expected considering that the Allen decoding is mostly ported from MuonRawInUpgradeToHits. I'll try to dig a bit, but let me also cc @bsciasci e @satta
First thing is to collect some MDF files from the pit, and make them available in EOS etc. Right now that might mean different ones for different sub-systems, as not everyone is in global yet... Or do we hve some runs where everyone (other than RICH sadly...) is available ?
To be correct, it is not a segfault per se, it is an assert that fails, so somewhere there must be a logic error in the code.
Meaning, it runs without a crash in the opt build.
I see the same thing on MC (when running on new samples), so this is not data specific.
The errors in the opt builds should be ignored for now. Asserts are only tested in debug builds, in the opt builds they are skipped, so clearly what you see in the opt builds is just a consequence of whatever reason the assert is failing, and its this the debugging should focus on.
I think there is a mix up: It is quite likely these are two different issues, as on MC I see the condition for the assert failing all the time (I just implemented a check myself, running on the opt build), and the 100 events run through successfully.
Obviously there is still something wrong, but I doubt what fires the assert causes the crash after 140001 events.
I also gave a look to the muon bank too short crash in data and the reason seems to be that the event 144316 is the first containing a bank with raw_bank -> size() = 0. This happens for one sourceID in many events:
EventSelector SUCCESS Reading Event record 144316. Record number within stream 1: 144316BANK SIZE 0BANK SOURCE ID 28686BANK TYPE 13EventSelector SUCCESS Reading Event record 144317. Record number within stream 1: 144317EventSelector SUCCESS Reading Event record 144318. Record number within stream 1: 144318BANK SIZE 0BANK SOURCE ID 28686BANK TYPE 13EventSelector SUCCESS Reading Event record 144319. Record number within stream 1: 144319
I added Michel's hack to print out more info with the real data in the debug mode, so it seems to me that this line should be blamed for the assert break.
@samarian what you described seems suspiciously compatible with a hardware problem we are investigating: sometimes some data links suddenly send a high number of idle frames and cause a link de-synchronization. All subsequent events are tagged with a DAQ error type 0x5A by the TELL40, until a new synchronization is performed. I don't know if this could explain the null size of the raw bank. Anyway, the change you propose is fine, but maybe it would be better to generate a warning. Is it possible?
A seg. Fault is of course never acceptable, so whatever the decoding needs to do to handle these errors gracefullly, this should be done.
A warning/error is a good idea, but bear in mind what might seem a good idea when running a single job locally, might become an unmanageable stream of messages at the pit when running the full hlt, so be careful with such things.
I wonder if the DAQErrorBanks could really explain the null size of the error banks, given that the bank type is reported as 13? DAQErrorBanks should be of bank type 89 or larger if I'm not mistaken.
Concerning the assert that stops the running on real data, we can clearly remove the assert and something different, but this is something that should never happens.. this is why there is this assert. In Run1-2 there was a similar checks and it never fails. Before changing it I think we need to carefully understand what is happening. I checked a bit and there are oredr of 10k events with this error. In the first half of then it happens only in one pci that it extends to other 2 pci which is quite strange. In our tell40 we have an header which is always written, In such corrupted banks also the header is missing. We need to understand if it can be due to HW upstream the tell40, to the Tell40 or to the EB
@satta I see your point but even if this "should never happen" an assert is not ideal from an operational point of view. So I would remove the assert immediately and then indeed try to understand what happens so you fix the cause and not only the symptom.
@jonrob@graven do you agree with my comment about asserts in general?
Asserts have a use, I use them in the RICH, but you have to bear in mind two things with them
They only apply in debug builds. In opt builds them are ignored.
If you hit them you immediately terminate the application.
For these reasons they are not the thing to use if you want to validation something in real data. even if you think its a 'this should never happen' as seen here these do sometime happen.
So yes, in this case something else would probably be appropriate.
I have never understood the value of Assert unless you are actively debugging a piece of code. As already said, you never see it when running in production, you just skip over it. So better to print an Error or similar, even if then the default behaviour is to ignore it.
As I said, they have their use. I use them in the RICH for the very purpose you mention, i.e. I use the debug builds myself when developing and have scattered the code with a bunch of checks I want to have when I run locally myself, but I do not want to limit the performance in the released optimized builds and nor do I need them to be checked there. Using asserts is perfect for this as I don't have to constantly comment them in/out when committing. I can just leave them in and know the opt builds ignore them.
But yes, you have to be careful how you use them. They shouldn't be used to validate some condition you expect 'to never happen' in the data processing loop, for instance.
@satta@samarian@jonrob could we please proceed with a pragmatic fix here ASAP? It is now extremely important to have stable versions of all subdetector decoding algorithms so we can really begin global reconstruction studies with data.
@gligorov fixing the assert was a trivial issue, if thats all the problem was I would have done it myself a while back. The underlying reason why the assert was being hit needs to be investigated, and that needs a muon system expert.
Absolutely, a warning/error message with some max number of printouts before it goes silent is perfect from a datataking point of view here. Then what do you think about the proposed fix to the second issue which was observed?
Actually the assert was fixed at !3726 (diffs) (it actual condition was not correct and there was no underlying issue).
On the usefulness of asserts, I can just add that for commissioning we've been running (almost?) all monitoring tasks from -dbg builds and that has been really helpful (for when we need to attach a debugger). When we hit an assert, it is annoying but just going over means we might be ignoring some underlying issue.
we can probably convert almost all asserts to exceptions without much loss in performance. (or easier to do, but slightly less useful (no stack trace online) => enable asserts in opt builds)
@gligorov No. If I am using an assert it is specifically because I want there to be absolutely nothing in the opt builds. it would also require a lot of ugly preprocessor checks to do something different depending on if NDEBUG is set or not.
So no, its either
The author knows what they are doing and specifically only wants this cross check in debug builds, in which case the trivial
assert( some_condition );
is the way to go. or...
The author wants a runtime check in all builds, in which case they use