Unpack_Hlt2__Event_HLT2_Rec_Trac... WARNING Attempted out-of-range access to packed LHCbIDs. This is bad, do not ignore !TUnixSystem::DispatchSignals FATAL segmentation violation
Unpack_Hlt2__Event_HLT2_Rec_Trac... FATAL Standard std::exception is caught Unpack_Hlt2__Event_HLT2_Rec_Trac... ERROR vector::reserveHLTControlFlowMgr FATAL Event failed in Node TrackUnpacker/Unpack_Hlt2__Event_HLT2_Rec_Track_Velo
Need a way to isolate these online and investigate at the pit
Run a test unpacking of the raw banks during Hlt2 and stream off events to error stream?
Largely issues seem to be with track unpacking which can lead to seg faults.
Can we turn these into a handled single event error?
'Largely issues seem to be with track unpacking' - Just to be clear most issues are not of this kind. There is though a class of errors listed at lhcb-dpa/prod-requests#86 (comment 8631813) which result in a seg. fault and which come from the track unpacking (or more precisely LHCb::FTLiteCluster::setLiteCluster). Whatrver the reason for this the seg. fault should be squashed so at worst these become another example of 'single event failures' which can be handled at skipped if need be.
So, I have found the source of the seg faults in the track unpacking relating to the packed clusters. Basically, when support for these was added any sort of sanity protection on the indices was not added, and in this case the corruption has caused these to be invalid. See
The event in question then takes a very long time to process, with the memory usage eventually hitting 40G (which in itself probably renders the file unprocessable offline). Finally the task dies with
#10 LHCb::TrackPacker::convertFTLiteCluster (this=<optimized out>, packed_cluster=..., track=...) at ../Event/EventPacker/src/lib/PackedTrack.cpp:205#11 0x00007fa3e184d9c3 in operator() (__closure=__closureentry=0x7fa3ce223100, cluster=...) at ../Event/EventPacker/src/lib/PackedTrack.cpp:362#12 0x00007fa3e184da20 in std::for_each<__gnu_cxx::__normal_iterator<const LHCb::PackedFTLiteCluster*, std::vector<LHCb::PackedFTLiteCluster> >, LHCb::TrackPacker::unpack(const PackedData&, Data&, const PackedDataVector&, DataVector&) const::<lambda(const LHCb::PackedFTLiteCluster&)> >(__gnu_cxx::__normal_iterator<LHCb::PackedFTLiteCluster const*, std::vector<LHCb::PackedFTLiteCluster, std::allocator<LHCb::PackedFTLiteCluster> > >, __gnu_cxx::__normal_iterator<LHCb::PackedFTLiteCluster const*, std::vector<LHCb::PackedFTLiteCluster, std::allocator<LHCb::PackedFTLiteCluster> > >, struct {...}) (__first=<error reading variable: Cannot access memory at address 0x7fa1de5a3000>, __last=__lastentry=..., __f=...) at /cvmfs/lhcb.cern.ch/lib/lcg/releases/gcc/13.1.0-b3d18/x86_64-el9/include/c++/13.1.0/bits/stl_algo.h:3833#13 0x00007fa3e184e23b in LHCb::TrackPacker::unpack (this=thisentry=0x7fa3ce223350, ptrack=..., track=..., ptracks=...) at ../Event/EventPacker/src/lib/PackedTrack.cpp:359#14 0x00007fa3e184e79b in LHCb::TrackPacker::unpack (this=thisentry=0x7fa3ce223350, ptracks=..., tracks=...) at ../Event/EventPacker/src/lib/PackedTrack.cpp:463#15 0x00007fa3e184e7fd in LHCb::unpack (parent=<optimized out>, in=..., out=...) at ../Event/EventPacker/src/lib/PackedTrack.cpp:727#16 0x00007fa3d5b29e7f in LHCb::Hlt::PackedData::restoreObject<KeyedContainer<LHCb::Event::v1::Track, Containers::KeyedObjectManager<Containers::hashmap> > > (buffer=..., header=..., loader=...) at ../Event/EventPacker/src/component/BufferUnpackerBaseAlg.h:305#17 0x00007fa3d5bc54f5 in LHCb::Hlt::PackedData::restoreObject<KeyedContainer<LHCb::Event::v1::Track, Containers::KeyedObjectManager<Containers::hashmap> > > (buffer=..., loader=...) at ../Event/EventPacker/src/component/BufferUnpackerBaseAlg.h:319#18 0x00007fa3d5bc5d3c in DataPacking::Buffer::Unpack<KeyedContainer<LHCb::Event::v1::Track, Containers::KeyedObjectManager<Containers::hashmap> > >::operator() (this=thisentry=0xfe7b480, buffers=...) at ../Event/EventPacker/src/component/BufferUnpackerBaseAlg.h:422
OK, found the issue. The problem was actually in the sanity check already there for the LHCbIDs, which I reused, which did not handle well the case the first index < 0 (yes, negative, which can happen if what you are reading is garbage..).
Don't worry, our messages crossed. The issue was due to the corruption having first indices with huge negative values, and the protection did not catch this (it just required first <= last ). This resulted in an insane number of clusters being 'decoded'. Like 1.8 billion..
See my message below. The only thing I can say is there is some corruption in the data as far as the first/last indices for the packed clusters. These are insane values and hence the need for the protection in the MR. Where the corruption first starts needs to be found...
Just scanning the reasons there, I don't think the issue is only with the tracks, so my bet is the corruption is deeper and manifests in various ways. But until its found its anyones guess.
Could we look at just the recsummary in these events, which should contain counters of the number of tracks and hits? Maybe this will give us a clue whether we are hitting some limit...
You could try, but I would be surprised if that was the cause of all the issues in the above link. Also, Even if the reco hit some limit, the packing should never create garbage banks, so either there is a problem there, in the packing, in the way the events are processed when some limit is hit, or the corruption is happen even later than that somehow…
Honestly, the only way i can us making real progress here is for someone to reproduce the corruption as it occurs in HLT2, running offline. Until then just looking at the aftermath in terms of what happens when the data is read back in is really limited in what it can tell us.
Would it help if we have any of these events in turcal files, i.e. ones with the tracker raw banks? I'm wondering if we have any case where it's just the dstdata that is 'a bit corrupted' such that one can run hlt2 over it again.
@mveghel@ldufour Given that we have identified the track clusters as one place corruption is present in the persistent data, I think we should look into how these packed data are formed in the first place, to see if there is some issue there or not. We have to start somewhere to address the root cause of the problems here..
Thanks, the more eyes here the better. I would suggest running with !4730 (merged) as that should clear up (in so far as capture the error and cause a clean event abort) some issues.
What would be great is if someone could try and rerun HLt2 on some offline data that has one of these issues, and see if we can reproduce whatever happened there. Note the data above is for the full stream, so might not have the required banks but lhcb-dpa/prod-requests#88 (closed) lists some similar issues for turcal data.
Note that a monitoring algorithm could be written which, immediately after the packing, would verify the internal consistency of the packed data by unpacking everything -- that would allow one to generate error messages earlier in the chain, i.e. it would distinguish between bugs in the packing itself vs corruption generated 'after' the packing, during the 'transport' of the DstData bank.
One could start from UnpackDstDataBank which unpacks 'everything' contained in the DstData bank -- unfortunately it won't work 'out of the box', as-is, as it will end up trying to unpack everything into TES locations (most of) which are already populated, but surely a variation could be created which 'shunts' all the TES containers to-be-unpacked into a separate branch of the TES (i.e. some configured prefix is added to all the non-anonymous TES locations -- which could be done by tweaking the decoder - i.e. the bit of code that uses the ANNSvc to map ids <-> TES locations - used by the loader instance returned by loader_for which is called from here by UnpackDstDataBank, and which is conveniently distinct from the 'individual' decoding algorithms which call it from here instead) so as to avoid clashing with the input. It will create some overhead, but may help in tracking down the problem.