Raw banks that cannot be unpacked downstream - how to isolate these online?

mentioned in issue lhcb-dpa/prod-requests#86 (closed)

'Largely issues seem to be with track unpacking' - Just to be clear most issues are not of this kind. There is though a class of errors listed at lhcb-dpa/prod-requests#86 (comment 8631813) which result in a seg. fault and which come from the track unpacking (or more precisely LHCb::FTLiteCluster::setLiteCluster). Whatrver the reason for this the seg. fault should be squashed so at worst these become another example of 'single event failures' which can be handled at skipped if need be.

So, I have found the source of the seg faults in the track unpacking relating to the packed clusters. Basically, when support for these was added any sort of sanity protection on the indices was not added, and in this case the corruption has caused these to be invalid. See

https://gitlab.cern.ch/lhcb/LHCb/-/blob/2024-patches/Event/EventPacker/src/lib/PackedTrack.cpp?ref_type=heads#L265

adding some basic sanity checks, similar to those here

https://gitlab.cern.ch/lhcb/LHCb/-/blob/2024-patches/Event/EventPacker/src/lib/PackedTrack.cpp?ref_type=heads#L260

gets me past the seg. fault.

However...

The event in question then takes a very long time to process, with the memory usage eventually hitting 40G (which in itself probably renders the file unprocessable offline). Finally the task dies with

#10 LHCb::TrackPacker::convertFTLiteCluster (this=<optimized out>, packed_cluster=..., track=...) at ../Event/EventPacker/src/lib/PackedTrack.cpp:205
#11 0x00007fa3e184d9c3 in operator() (__closure=__closure
entry=0x7fa3ce223100, cluster=...) at ../Event/EventPacker/src/lib/PackedTrack.cpp:362
#12 0x00007fa3e184da20 in std::for_each<__gnu_cxx::__normal_iterator<const LHCb::PackedFTLiteCluster*, std::vector<LHCb::PackedFTLiteCluster> >, LHCb::TrackPacker::unpack(const PackedData&, Data&, const PackedDataVector&, DataVector&) const::<lambda(const LHCb::PackedFTLiteCluster&)> >(__gnu_cxx::__normal_iterator<LHCb::PackedFTLiteCluster const*, std::vector<LHCb::PackedFTLiteCluster, std::allocator<LHCb::PackedFTLiteCluster> > >, __gnu_cxx::__normal_iterator<LHCb::PackedFTLiteCluster const*, std::vector<LHCb::PackedFTLiteCluster, std::allocator<LHCb::PackedFTLiteCluster> > >, struct {...}) (__first=<error reading variable: Cannot access memory at address 0x7fa1de5a3000>, __last=__last
entry=..., __f=...) at /cvmfs/lhcb.cern.ch/lib/lcg/releases/gcc/13.1.0-b3d18/x86_64-el9/include/c++/13.1.0/bits/stl_algo.h:3833
#13 0x00007fa3e184e23b in LHCb::TrackPacker::unpack (this=this
entry=0x7fa3ce223350, ptrack=..., track=..., ptracks=...) at ../Event/EventPacker/src/lib/PackedTrack.cpp:359
#14 0x00007fa3e184e79b in LHCb::TrackPacker::unpack (this=this
entry=0x7fa3ce223350, ptracks=..., tracks=...) at ../Event/EventPacker/src/lib/PackedTrack.cpp:463
#15 0x00007fa3e184e7fd in LHCb::unpack (parent=<optimized out>, in=..., out=...) at ../Event/EventPacker/src/lib/PackedTrack.cpp:727
#16 0x00007fa3d5b29e7f in LHCb::Hlt::PackedData::restoreObject<KeyedContainer<LHCb::Event::v1::Track, Containers::KeyedObjectManager<Containers::hashmap> > > (buffer=..., header=..., loader=...) at ../Event/EventPacker/src/component/BufferUnpackerBaseAlg.h:305
#17 0x00007fa3d5bc54f5 in LHCb::Hlt::PackedData::restoreObject<KeyedContainer<LHCb::Event::v1::Track, Containers::KeyedObjectManager<Containers::hashmap> > > (buffer=..., loader=...) at ../Event/EventPacker/src/component/BufferUnpackerBaseAlg.h:319
#18 0x00007fa3d5bc5d3c in DataPacking::Buffer::Unpack<KeyedContainer<LHCb::Event::v1::Track, Containers::KeyedObjectManager<Containers::hashmap> > >::operator() (this=this
entry=0xfe7b480, buffers=...) at ../Event/EventPacker/src/component/BufferUnpackerBaseAlg.h:422

I need to look into this one a bit more..

Interesting. So you set the clusters also to something similar like the ids (see below)? Or you want to set a maximum?

lastId = firstId = 0;

An absolute maximum should be fine at like a 100 I guess (enough margin with the physical maxima).

OK, found the issue. The problem was actually in the sanity check already there for the LHCbIDs, which I reused, which did not handle well the case the first index < 0 (yes, negative, which can happen if what you are reading is garbage..).

Ah! Thanks

Don't worry, our messages crossed. The issue was due to the corruption having first indices with huge negative values, and the protection did not catch this (it just required first <= last ). This resulted in an insane number of clusters being 'decoded'. Like 1.8 billion..

fixing that I no longer need to limit the max number, I think, although maybe some limit might be an idea just in case...

See !4730 (merged)

Note, that does not resolve this issue ! The underlying corruption in the data is of course still present and needs investigation.

fyi @ldufour

yes, would indeed still be good to add some extra protection if needed. Could this also be from corrupted raw banks?

See my message below. The only thing I can say is there is some corruption in the data as far as the first/last indices for the packed clusters. These are insane values and hence the need for the protection in the MR. Where the corruption first starts needs to be found...

I'm on leave and can't look at this in the coming weeks.

b.t.w. you can look at the list of errors seen offline using the logs link here

lhcb-dpa/prod-requests#86 (comment 8631813)

Just scanning the reasons there, I don't think the issue is only with the tracks, so my bet is the corruption is deeper and manifests in various ways. But until its found its anyones guess.

Could we look at just the recsummary in these events, which should contain counters of the number of tracks and hits? Maybe this will give us a clue whether we are hitting some limit...

You could try, but I would be surprised if that was the cause of all the issues in the above link. Also, Even if the reco hit some limit, the packing should never create garbage banks, so either there is a problem there, in the packing, in the way the events are processed when some limit is hit, or the corruption is happen even later than that somehow…

Honestly, the only way i can us making real progress here is for someone to reproduce the corruption as it occurs in HLT2, running offline. Until then just looking at the aftermath in terms of what happens when the data is read back in is really limited in what it can tell us.

Would it help if we have any of these events in turcal files, i.e. ones with the tracker raw banks? I'm wondering if we have any case where it's just the dstdata that is 'a bit corrupted' such that one can run hlt2 over it again.

mentioned in merge request !4730 (merged)

@mveghel @ldufour Given that we have identified the track clusters as one place corruption is present in the persistent data, I think we should look into how these packed data are formed in the first place, to see if there is some issue there or not. We have to start somewhere to address the root cause of the problems here..

To reproduce the issue build Moore and run

> lbexec SprucingConfig.Spruce24.Sprucing_production_physics_pp_Collision24c3:excl_spruce_production ./RawBankCorruption.yaml

using this yaml file

RawBankCorruption.yaml

you should first manually download the LFN in question here and then run over a local copy, its waaay faster than trying to do it over xrootd

# Packed track seg. fault
#   LFN:/lhcb/data/2024/RAW/FULL/LHCb/COLLISION24/305875/305875_00210005_0026.raw
input_files:
- PFN:/usera/jonesc/NFS/data/RunIII/Hlt2/errors/RawBankCorruption/305875_00210005_0026.raw

can you take a look ?

I'll take a look

Thanks, the more eyes here the better. I would suggest running with !4730 (merged) as that should clear up (in so far as capture the error and cause a clean event abort) some issues.

What would be great is if someone could try and rerun HLt2 on some offline data that has one of these issues, and see if we can reproduce whatever happened there. Note the data above is for the full stream, so might not have the required banks but lhcb-dpa/prod-requests#88 (closed) lists some similar issues for turcal data.

Note that a monitoring algorithm could be written which, immediately after the packing, would verify the internal consistency of the packed data by unpacking everything -- that would allow one to generate error messages earlier in the chain, i.e. it would distinguish between bugs in the packing itself vs corruption generated 'after' the packing, during the 'transport' of the DstData bank.

One could start from UnpackDstDataBank which unpacks 'everything' contained in the DstData bank -- unfortunately it won't work 'out of the box', as-is, as it will end up trying to unpack everything into TES locations (most of) which are already populated, but surely a variation could be created which 'shunts' all the TES containers to-be-unpacked into a separate branch of the TES (i.e. some configured prefix is added to all the non-anonymous TES locations -- which could be done by tweaking the decoder - i.e. the bit of code that uses the ANNSvc to map ids <-> TES locations - used by the loader instance returned by loader_for which is called from here by UnpackDstDataBank, and which is conveniently distinct from the 'individual' decoding algorithms which call it from here instead) so as to avoid clashing with the input. It will create some overhead, but may help in tracking down the problem.

Raw banks that cannot be unpacked downstream - how to isolate these online?

Child items ...

Activity