In dst file I get after running the three HLT 2 lines I attach succesfully in Moore, I see that DaVinci ends the loop of events because it finds an uncomplete event giving the error
HltPackedDataDe... ERROR Failed to decompress HltPackedData.HltPackedDataDe... ERROR Maximum number of errors ( 'ErrorMax':1) reached.EventLoopMgr WARNING Execution of algorithm HltPackedDataDecoder failedUnpackTracks ERROR Cannot retrieve /Event/HLT2/pRec/Track/Best from transient store.UnpackTracks ERROR Maximum number of errors ( 'ErrorMax':1) reached.EventLoopMgr WARNING Execution of algorithm UnpackTracks failedEventLoopMgr ERROR Error processing event loop.EventLoopMgr ERROR Terminating event processing loop due to errorsEventLoopMgr ERROR Terminating event processing loop due to errors
This doesn't happen always, just for some xdigi file, for example root://x509up_u133448@ccxrootdlhcb.in2p3.fr//pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/MC/Upgrade/XDIGI/00146968/0000/00146968_00000290_1.xdigi
It also doesn't happen when running the lines separetly in Moore but only when running the three at the same time.
Here I attach the options scripts for Moore and DaVinci and the logs of them running. We used released Moore v53r2 and DaVinci v54r1.
Hi @pherrero, we have seen similar error messages when testing the mdf output produced after running hlt2 generic lines, which might be related:
Show error messages
HltPackedDataDe... FATAL Standard std::exception is caughtHltPackedDataDe... ERROR PackedTracks packing version is not supported: 0EventLoopMgr FATAL .executeEvent(): Standard std::exception thrown by HltPackedDataDecoderEventLoopMgr ERROR PackedTracks packing version is not supported: 0EventLoopMgr WARNING Execution of algorithm HltPackedDataDecoder failedUnpackTracks ERROR Cannot retrieve /Event/HLT2/pRec/Track/Best from transient store.UnpackTracks ERROR Maximum number of errors ( 'ErrorMax':1) reached.EventLoopMgr WARNING Execution of algorithm UnpackTracks failedUnpackPVs ERROR Cannot retrieve /Event/HLT2/pRec/Vertex/Primary from transient store.UnpackPVs ERROR Maximum number of errors ( 'ErrorMax':1) reached.EventLoopMgr WARNING Execution of algorithm UnpackPVs failedUnpackChargedPr... ERROR Cannot retrieve /Event/HLT2/pRec/ProtoP/Charged from transient store.UnpackChargedPr... ERROR Maximum number of errors ( 'ErrorMax':1) reached.EventLoopMgr WARNING Execution of algorithm UnpackChargedProtos failedUnpackNeutralPr... ERROR Cannot retrieve /Event/HLT2/pRec/ProtoP/Neutrals from transient store.UnpackNeutralPr... ERROR Maximum number of errors ( 'ErrorMax':1) reached.EventLoopMgr WARNING Execution of algorithm UnpackNeutralProtos failedUnpackParticles... ERROR UnpackParticlesAndVertices:: Unpack('/Event/HLT2/pPhys/Relations'): missing destination '/Event/HLT2/Rec/Vertex/Primary', skip link StatusCode=FAILUREEventLoopMgr ERROR Error processing event loop.EventLoopMgr ERROR Terminating event processing loop due to errorsEventLoopMgr ERROR Terminating event processing loop due to errors
To reproduce my error messages one just need to copy the following scripts (reproducer.py and test_output.py) to Moore folder and run the following reproducer.sh:
These error messages are not reproducible in an old version of stack (e.g. from 21/10), so some MR merged between 21/10 and 10/11 (my current stack date) must have broken something...
I would need a little bit more understanding of what is the context. Am I correct that :
you first run Moore, always on the same file, but potentially with different lines
you store out put to a DST
you run DaVinci on this DST
depending on the lines in Moore, DaVinci fails or not
the error mentions incomplete event, meaning some pieces are missing in the DST file ?
If that's all true, I agree with @sesen that functional (un)packers are a potential suspect.
But in that case the first thing to do is to compare the 2 cases and identify the missing piece of data. We can then check whether it's present or not in the file and find out why it was not packed to the file, or not unpacked if it happens to be in the file.
The error posted by Paula was seen with the released Moore version, which was released before LHCb!3192 (merged) merged in. So I would expect it is not related to that.
Yes, the test was performed on a master from 10/11 and the problem didn't exist in the released version, while for DST it seems to be the opposite. So before closing we should test on today's master for mdf output. @fsouzade is already on it.
I don't see that problem with that line anymore but there is a similar error I still get with the current master when using the JpsiKst line I attach.
I found this issue already some time ago when using Moore v52r2, and in that case, it corresponded to events that raised this error in the Moore log HltPackedDataWriter ERROR HltPackedDataWriter:: Packed objects too long to save. Even this error was raised, the Moore job continues and finishes successfully. This error is not raised anymore when running Moore with the current master but DaVinci still fails when arriving at that event with the message
HltPackedDataDe... ERROR Failed to decompress HltPackedData.HltPackedDataDe... ERROR Maximum number of errors ( 'ErrorMax':1) reached.EventLoopMgr WARNING Execution of algorithm HltPackedDataDecoder failedUnpackTracks ERROR Cannot retrieve /Event/HLT2/pRec/Track/Best from transient store.UnpackTracks ERROR Maximum number of errors ( 'ErrorMax':1) reached.EventLoopMgr WARNING Execution of algorithm UnpackTracks failedEventLoopMgr ERROR Error processing event loop.EventLoopMgr ERROR Terminating event processing loop due to errorsEventLoopMgr ERROR Terminating event processing loop due to errors
Even with an updated stack I also still see the same error messages I quoted before, some of them very similar to the ones @pherrero have. But in my case the problem seems related to the line Hlt2PhotonDecision, as these messages don't appear if I use another line for the turbo stream. Here is a simple reproducer:
My guess is something goes wrong when serializing PackedTracks with compression, then when HltPackedDataDecoder gets the wrong packing version from the header. If I am right, this can be easily seen in debug mode for HltPackedDataWriter.
So the point is indeed that the version found by the unpacker is 0 while it's expecting 4 or 5. Is that first point correct ? Do we indeed expect 4 or 5 ? I'll start to check the packer code but I would not be so surprised that it used to be 0 and just was not noticed so far, in other terms that the bug was there forever but only seen now that the code was cleaned up. Is there a way to check this hypothesis ?
And I have a first potential explanation, bug along the line of hidden error that appears now : could it be that this whole Pack/Unpack process was just void, that is there was nothing at all to pack/unpack ?
That might explain things because the old packers were not complaining is this case and doing nothing, so no data at all in the file. And the old Unpacker was similarly doing nothing and silently pretending it unpacked while there was nothing to do.
The new ones are doing their job. So the packer in creating an empty output, empty but existing, and the unpacker will expect it to exist, even if empty. So imagine somehow we still put nothing in the file for some reason in case of empty data, the new unpacker would exactly behave like what we see
Wait a minute, I've just noticed the previous message in the log, just before the unsupported version one :
TCK obtained from rawbank seems to be 0 -- blindly ASSUMING that the current HltANNSvc somehow has the same configuration as when the input data was written. Proceed at your own risk, good luck...
Well, if you get an empty bank and still try to read an object from it, there are indeed high chances things go wrong. And indeed the first byte read is version and will most probably be 0. Can you also check that part ?
Hi @sponce, thanks for looking into it. This message
TCK obtained from rawbank seems to be 0 -- blindly ASSUMING that the current HltANNSvc somehow has the same configuration as when the input data was written. Proceed at your own risk, good luck...
Running the same reproducer2.sh with the Hlt2PhotonLine with an old stack (from 21/10) I still get the same message, but the
error messages regarding PackedTracks
HltPackedDataDe... FATAL Standard std::exception is caughtHltPackedDataDe... ERROR PackedTracks packing version is not supported: 0EventLoopMgr FATAL .executeEvent(): Standard std::exception thrown by HltPackedDataDecoderEventLoopMgr ERROR PackedTracks packing version is not supported: 0EventLoopMgr WARNING Execution of algorithm HltPackedDataDecoder failedUnpackTracks ERROR Cannot retrieve /Event/HLT2/pRec/Track/Best from transient store.UnpackTracks ERROR Maximum number of errors ( 'ErrorMax':1) reached.
There are other error messages which are the same, but the tests finishes "successfully" in the old version. So I guess you're right that the problem could be already there and the old packers/unpackers were just not complaining at all, making the test to finish "successfully".
In any case I also put Moore to run with HltPackedDataWriter in DEBUG mode as suggested by @sesen, and the logs are here:
Making a diff of the two files gives diff_reproducer2.log, which shows some differences in the number of compressed bytes, in packed data checksum and in the lenght of raw banks with the same sourceID...
The message is usually a false alarm. At the moment the HltANNSvc is configured via a JSON file written when configuring the writing job and read when configuring the reading job. That's the "somehow".
Interesting. We seem to write less bytes, which would mean that the new packer writes less than the old. I was expecting the opposite actually : new one writing an empty object that old one was not writing.
Concerning the TCK message, it's still worrying. It would be better to understand it, as it can be at the source of the potentially unveiled problem (supposing it was there and unnoticed).
I'll try to reproduce and add some print statements to see what happens.
Finally found the time to check this. And found the bug :-) LHCb!3328 (merged) should fix it. Can you please try ?
What happened is exactly what I supposed : we are in a completely degenerated case where packing and unpacking are called while they should not be. In such a case, we write an empty PackTrack to file. Now it needs a version or we fail unpacking it, even if it's empty... And the new version did not set the version in such a degenerated case.
We really need to clean up all this and call what we need to call properly, it's just a nightmare to maintain.
@sponce For the MR I am working on, I removed all the default locations for un/packers so they will only be called when needed. It still needs some work to configure everything properly but it's almost there..
Hi @sponce, @sesen, sorry for the late response. Now with LHCb!3328 (merged) I don't see those messages anymore, and the test is 'successful' once again. Thank you! For the record, here I put the logs in case anyone wants to see the number of compressed bytes, checksum and length of raw banks with the same sourceID etc (and compare to the numbers from previous or future logs):
The original problem occurs again with the latest master, Moore runs without errors, and then DV fails because of an incomplete event with
HltPackedDataDe... ERROR Failed to decompress HltPackedData.HltPackedDataDe... ERROR Maximum number of errors ( 'ErrorMax':1) reached.EventLoopMgr WARNING Execution of algorithm HltPackedDataDecoder failedUnpackTracks ERROR Cannot retrieve /Event/HLT2/pRec/Track/Best from transient store.UnpackTracks ERROR Maximum number of errors ( 'ErrorMax':1) reached.EventLoopMgr WARNING Execution of algorithm UnpackTracks failedEventLoopMgr ERROR Error processing event loop.EventLoopMgr ERROR Terminating event processing loop due to errorsEventLoopMgr ERROR Terminating event processing loop due to errors
I have run Moore and DaVinci in debug mode and the logs can be found here /afs/cern.ch/work/p/pherrero/public/logs.
To reproduce in DaVinci all files are here /afs/cern.ch/work/p/pherrero/public/files and for Moore /afs/cern.ch/work/p/pherrero/public/moore_options_bs.py with the line /afs/cern.ch/work/p/pherrero/public/Bs2JpsiPhi_loose.py
Are you sure you are on latest LHCb? I see EventLoopMgr DEBUG Creating Top Algorithm UnpackTrackFunctional with name UnpackTracks in your log file. UnpackTrackFunctional is now just called UnpackTrack.
That's because I was using the released version of DaVinci to be able to use TupleTools, I included the TuplesTools to DaVinci from the latest master and still fails, the new log in debug mode is again here /afs/cern.ch/work/p/pherrero/public/logs and the UnpackTrackFunctional is changed to UnpackTrack, but this was not the problem.
intdata_size=0;intoutput_size=0;if(R__unzip_header(&data_size,(unsignedchar*)data.data(),&output_size)){// invalid header...std::cout<<"unzip_header_false"<<std::endl;returnfalse;}if(static_cast<size_t>(data_size)!=data.size()){// recorded size is not the size we have been givenstd::cout<<"header_size_false"<<std::endl;returnfalse;}std::cout<<"header_size : data_size "<<static_cast<size_t>(data_size)<<" data.size "<<data.size()<<std::endl;std::cout<<" output_size_1 "<<output_size<<std::endl;m_buffer.resize(output_size);intbuffer_size=m_buffer.size();R__unzip(&data_size,(unsignedchar*)data.data(),&buffer_size,(unsignedchar*)m_buffer.data(),&output_size);std::cout<<" output_size_2 "<<output_size<<" buffer size "<<m_buffer.size()<<std::endl;returnstatic_cast<size_t>(output_size)==m_buffer.size();// did we find exactly the amount of (uncompressed) data expected?if(static_cast<size_t>(output_size)!=m_buffer.size()){std::cout<<"data_size_false"<<std::endl;}std::cout<<"data_size"<<std::endl;}}// namespace LHCb::Hlt::PackedData
this is what I see for the event that makes it fail
From the (ROOT) code, it looks like the reason for the decompression failure should be given on stderr. Now I do not think stderr was captured in yout log_DV.txt, can you check and tell me was you see there ?
THE stderr when running the job. When capturing the output, you're probably doing something like > log_DV.txt in order to capture stdout, but if you want to capture stderr, you need > log_DV.txt 2>&1
Weird. In ROOT code, any return statement leading to 0 in output_size have a message to stderr. And you indeed activated it as we see diffs at the top of the file. I fear I'm stuck on this, I have to reproduce it locally and attach a debugger. Let me try
I could not reproduce as I can not read the input file from root://x509up_u133448@ccxrootdlhcb.in2p3.fr//pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/MC/Upgrade/XDIGI/00146968/0000/00146968_00000290_1.xdigi I probably do not have proper credentials
Oh I see. Now I'm not sure I even have a valid grid certificate ! And here is what I get when trying :
>lhcb-proxy-init [master.]Wed Nov 24 8:37:37the directory /cvmfs/lhcb.cern.ch/lhcbdirac/versions/v10.3.2-x86_64-1637575535/etc/grid-security/vomsdir does not existzsh: exit 1 lhcb-proxy-init
Maybe you could copy the file to afs, with the others ?
Thanks for the file. I can now successfully run your Moore reproducer. But that's the problem : successfully ! I have no error with my setup. I'm running with head of master, so maybe the problem has been solved recently ?
This is a bit complicated as DecayTreeTuple was removed from the master recently. To use this, we tried to make MooreAnalysis as dependency of DaVinci. You could checkout peilian_dtt branches (same names) both in DaVinci and MooreAnalysis, then you would be able to run the DaVinci with DecayTreeTuple.
With these branches, I cannot even configure DaVinci. Somehow it cannot find MooreAnalysis although I have it locally and it compiled fine. So still not able to reproduce
werid, I just tested and it compiled for me successfully. Do you have the whole stack? (I assumed you do).
The branch peilian_dtt in DaVinci only modifies two files, please see diff below: (maybe you could just modify your master DaVinci)
I've finally managed to compile. Not sure what was the problem... But now to the next issue :
ToolSvc ERROR Cannot create tool TupleToolTrackInfo (No factory found)TuplePhi FATAL DecayTreeTuple:: Exception throw: tool():: Could not retrieve Tool 'TupleToolTrackInfo' StatusCode=FAILURETuplePhi FATAL Exception with tag= is caughtTuplePhi ERROR TuplePhi:: tool():: Could not retrieve Tool 'TupleToolTrackInfo' StatusCode=FAILUREEventLoopMgr FATAL .executeEvent(): Exception with tag= thrown by TuplePhiEventLoopMgr ERROR TuplePhi:: tool():: Could not retrieve Tool 'TupleToolTrackInfo' StatusCode=FAILURE
Just comment it out in the three places it appears in the DaVinci script, lines 99, 124 and 135. It needs also to be added to master, but it's not relevant for this problem.
Thanks ! That was the last issue. When trying to have Moore working I did change the number of events so that it was faster and did not have the problematic one. pfff... I hope the debugging will be faster that reproducing it !
I did not have time to completely solve this puzzle, but I've made good progress and maybe you'll be able to give me some helpful context.
What I found is that the problem is in the packing step, as expected, but more precisely in the zipping itself. To make a long story short, we are trying to zip a piece of data that is bigger than 0xffffff bytes (that is 16.7 M roughly) with an algo (ZSTD?) which seems to only support that maximum size. The net result is that the size of the original data is truncated in the compressed stream, preventing the unzipping to work properly.
Does that ring any bell ? It seems that there are at least 2 flows here : we do not check the size of what we send to ROOT for compression, and ROOT does not check either ! Now I'm surprised we discover this only now. Something must have changed somewhere, although I doubt it's related to the packer changes.
I'll dig further tomorrow morning but any input is welcome
We do saw similar problem long time ago, which is the case mentioned here and was reported in Upgrade HLT2 as well (unfortunately we were only told to check our selection to reduce the data size, no one looked into the real issue). In which case, Moore gave the ERROR "HltPackedDataWriter ERROR HltPackedDataWriter:: Packed objects too long to save." but didn't stop the process. Then when running the DV, it gave error about empty data. What I guessed was that the data was too big to be persisted due to maximum size limitation, so it persisted nothing but didn't skip that event, then it broke in the process of DV due to empty event.
So the problem is not in ZSTD itself, but in the wrapping that ROOT has on top of it. See code around line 48-58 in https://github.com/root-project/root/blob/master/core/zstd/src/ZipZSTD.cxx. You'll see that ROOT adds a header of 9 bytes, namely ZS\1 as a magic number followed by 3 bytes for the deflated size and 3 bytes for the inflated size. There is no check that the original size (given as an int so 4 bytes long) fits... And we do not check either ! That's it.
"Now remains the question of what we want to do for too big events"
'Too Big events' will still need to be saved. No matter what the limit is, the real experiment will always find a way to exceed that. So I guess if his happens we will need to break the data up into smaller chunks that the zipper can handle, and handle each independently...
@sponce is the buffer size limit (0xffffff) a fundamental limitation of the underlaying compression algorithms used (ZSTD,LZMA) or just an artificial limitation imposed by ROOT just because it passes buffer size as a int rather than something else with a bigger range (say unsigned long long or std::size_t) ? If the later then I think we should also push for a fix on the ROOT side, to remove (or extend) the limit as having the current one forced on us when it not a real limitation of the compression algorithm is non-ideal.
is the buffer size limit (0xffffff) a fundamental limitation of the underlaying compression algorithms used (ZSTD,LZMA) or just an artificial limitation imposed by ROOT ?
It's rather an artificial limitation due to ROOT. Or at least, ROOT limits it to 24 bits (so ~16MB) when the underlying algo goes up to 32 bits (~4GB) and would sort out our case. The problem is in the internal format ROOT creates in the byte stream. you can see the code here : https://github.com/root-project/root/blob/master/core/zstd/src/ZipZSTD.cxx#L50 :
ok so ROOT people do not want to change their protocol as they want to be forward compatible. A recipe for disaster in my opinion as you have to deal with the accumulated set of mistakes of the past. But anyway they won't help here.
I see 2 possibilities :
we give up and live with an error being correctly thrown in the future
we fix the protocol ourselves by recoding the ROOT method we are using. It's only a few lines and can be backward compatible (but not forward, that is old code cannot read new files when big events are present, or at least not these big events)
Just tell me if it's worth investing a few hours in the second option.
In my opinion we have already concluded we cannot live with this limitation for PbPb datataking even for purely reconstructed quantities (i.e. without selections and combinatorics) and therefore we need to fix it. I don't have an opinion whether your proposed fix @sponce is viable long-term maybe @jonrob@graven can comment.
Moore gave the ERROR "HltPackedDataWriter ERROR HltPackedDataWriter:: Packed objects too long to save." but didn't stop the process. Then when running the DV, it gave error about empty data. What I guessed was that the data was too big to be persisted due to maximum size limitation, so it persisted nothing but didn't skip that event, then it broke in the process of DV due to empty event
The above is correct -- the code in question see here does:
i.e. indeed, as the size becomes larger than approx 255*64K = 16M (as SourceIDMasks::PartID = 0x00FF currently) , it aborts without persisting anything. So indeed currently, this results in an output which just has 'no raw bank' of type DstData. It would be better if, in case of error, instead of writing nothing, it explicitly would write some 'error status' so that the downstream code could keep track on what happened 'upstream' without second guessing -- i.e. just a trivial bank of a dedicated type flagging just that it failed because it attempted to write X bytes of data...
Note that we could assign more bits to PartID as there are reserved bits, but do we really want try and write such huge events? It would be best (as mentioned above) to 'flag' these events as 'too much DstData' and then only pass them on (rate limited...) with the detector data received as input and not with the DstData added... (as presumably one would be able to 'replay' the Hlt on the input data and recreate the problem in order to debug it)