Draft: Fix MM pileup overlay time distribution
A MM digitization parameter setting a higher time limit in the digitization was incorrectly set too small and caused significant problems in the MM pileup and overlay time distribution of the strips.
This upper time limit would assign strip times to 400ns for all strips with t>400ns causing a massive peak in the time distribution and would allow these out-of-time strips to be used in the reconstruction. Not yet understood, but thoroughly tested, is the behaviour that this upper time limit would also create a population of strips with an erroneous time of zero.
Increasing the higher time limit by 300ns addresses all these issues and now properly returns the expected strip time distribution from MM presampling and overlay jobs. Attached are two figures comparing the problematic time distribution (peaked at 0ns and 400ns) and the time distribution post-fix.
The issue of large memory usage in the reconstruction of overlayed samples is still being investigated but this MR will fix a crucial contributor.
However, for an unknown reason, increasing the upper time window also creates FPE warnings in some MM L1 trigger jobs. I will add comments and job commands in the comments below. I have tracked down where the FPE error stems from and I can offer some possible causes. The MM digi and trigger sim experts can take a better informed look.
Merge request reports
Activity
added Digitization MuonSpectrometer Overlay urgent labels
This should be a temporary fix, because the time window has to be reduced to fit inside a 8 bunch crossing window, i.e. 200 ns. This conflicts with the calibration interface currently in development, which assume a 8BC window. We need advice on how to proceed.
Edited by Chav Chhiv ChauFor some information on the new FPE warnings in HLT MM L1 trigger simulation jobs. We have not been able to reproduce the FPE warnings when using the value of WindowUpperOffset of 500 set in this MR. The warnings only appear when we make the value larger to 1000. We do not understand this and may point to an issue in the digitization.
The HLT trigger command is taken from (ATR-25352). The first instance of the FPE warnings occurred when running on ttbar+pileup RDOs produced following this command. I have also included an overlayed muon + pileup RDO file and HTL trigger job command in this shared directory: /afs/cern.ch/user/a/alaurier/public/FPEWarning
A snippet of the FPE warning is found here:
71743 RDOtoRDOTrigger 22:11:20 FPEAuditor 13 0 WARNING FPE DIVBYZERO in [Execute] of [NSWL1Simulation] on event 47672012 1 0x7f9c3eea8d64 71744 RDOtoRDOTrigger 22:11:20 FPEAuditor 13 0 INFO FPE stacktrace 0 : 71745 RDOtoRDOTrigger 22:11:20 in function : std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_tra its<char> >&, char const*) (/cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-8a51a/x86_64-centos7/include/c++/11.2.0/ostream:611) 71746 RDOtoRDOTrigger 22:11:20 included from : fpe_sig_action (/build/atnight/localbuilds/nightlies/Athena/master/athena/Control/AthenaAuditors/src/FPEAudit_linux.icc:80) 71747 RDOtoRDOTrigger 22:11:20 in library : /cvmfs/atlas-nightlies.cern.ch/repo/sw/master_Athena_x86_64-centos7-gcc11-opt/2022-04-28T2101/Athena/22.0.65/InstallArea/x86_64-cent os7-gcc11-opt/lib/libAthenaAuditors.so 71748 RDOtoRDOTrigger 22:11:20 FPEAuditor 13 0 INFO FPE stacktrace 1 : 71749 RDOtoRDOTrigger 22:11:20 in function : __restore_rt (sigaction.c:0) 71750 RDOtoRDOTrigger 22:11:20 in library : /lib64/libpthread.so.0 71751 RDOtoRDOTrigger 22:11:40 FPEAuditor 13 0 INFO FPE stacktrace 2 : 71752 RDOtoRDOTrigger 22:11:40 in function : MMT_Hit::MMT_Hit(char, hitData_entry, MuonGM::MuonDetectorManager const*, std::shared_ptr<MMT_Parameters>) (/build/atnight/localbui lds/nightlies/Athena/master/athena/Trigger/TrigT1/TrigT1NSWSimTools/src/MMT_Hit.cxx:51) 71753 RDOtoRDOTrigger 22:11:40 in library : /cvmfs/atlas-nightlies.cern.ch/repo/sw/master_Athena_x86_64-centos7-gcc11-opt/2022-04-28T2101/Athena/22.0.65/InstallArea/x86_64-cent os7-gcc11-opt/lib/libTrigT1NSWSimToolsLib.so 71754 RDOtoRDOTrigger 22:11:44 FPEAuditor 13 0 INFO FPE stacktrace 3 : 71755 RDOtoRDOTrigger 22:11:44 in function : std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (/cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-8a51a/x86_64-centos7 /include/c++/11.2.0/bits/shared_ptr_base.h:704) 71756 RDOtoRDOTrigger 22:11:44 included from : std::__shared_ptr<MMT_Parameters, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() (/cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-8a51a/ x86_64-centos7/include/c++/11.2.0/bits/shared_ptr_base.h:1154)
The issue points to this file and line. The global Z position of a strip is 0 and division by 0 occurs, triggering the FPE warning.
The FPE warning follows several of these warnings:
71671 RDOtoRDOTrigger 22:11:14 MMLoadVariables 13 0 WARNING MicroMegas digitization: failed to associate a valid local position for (chip response) strip n. 54; associated positions will be set to 0.0.
stemming from this file and line.The local position of a MM strip is trying to be evaluated, but the query fails and then the subsequent local->global transform returns a 0 global Z coordinate, resulting in the FPE warning.
The function to get MM strip local positions should fail to retrieve the position for dead, pcb or routed strips (by definition). This may be a good starting point. To my knowledge, this strips should also never be created during the digitization so it is surprising that they now exist in the RDO. Somehow, increasing the digitization time window too much creates strips which are outside the valid range of strips on which to return the local positions.
Extra comment on May 9th 2022: In some cases in pileup premixing (and standard pileup jobs), the HITS can have kinetic energy equal to zero and these were not filtered out and caused the invalid digits and the MM digitization warnings.
I'm producing pileup samples overnight to validate if this is the source of all the MM digitization warnings and then if this also gets rid of all the FPE warnings.
Cheers, Alexandre
Hi Alexandre that's encouraging, but I would like to understand better what is going on. Are you saying that these HITS have a valid link to a particle whose kinetic energy is zero ? The only good reason I can think for that is if the particle caem to rest inside the sensitive volume. But why does the digitization create an invalid identifier in this case, and why should such energy deposit be ignored ?
cheers, Tommaso
Hi @tlari ,
Yes some HITS have zero kinetic energy from the post-step G4 values. I've identified some of these particles as protons, deuterons and argon nuclei.
The issue arises in this function, where the strips from all the hits on a certain layer are combined into 1 container: https://acode-browser1.usatlas.bnl.gov/lxr/source/athena/MuonSpectrometer/MuonDigitization/MM_Digitization/src/MM_DigitizationTool.cxx#1189
The Identifier is initialized but not assigned to anything by default and in the rare cases where a layer sees 1 HIT with strips where the KE is zero, the identifier is never assigned a value. Somehow, this null pointer returns a MDT identifier in the following logic of the DigitizationTool. I assume the rate of particles with KE=0 is much higher in pileup than hard-scatter events and why these warnings have only been noticed in the digitization of pileup.
It's simply a logic problem and the problem can be solved either by ignoring HITS with KE=0 or by initializing the identifier in the function I linked above to the Identifier of the first container. It's a question I am raising with the MM experts now.
Cheers, Alexandre
Hi @alaurier,
If SimHits caused by particles with KE=0 should be ignored in digi, then we can save some time by putting a cut in the Sensitive Detector and not writing those SimHits out in the first place maybe? https://gitlab.cern.ch/atlas/athena/-/blob/master/MuonSpectrometer/MuonG4/MuonG4SD/src/MicromegasSensitiveDetector.cxx#L49
Cheers,
John
Hi @jchapman ,
Good point. Seeing how we will now be ignoring MM SimHits with KE=0, I have made changes in !53149 (merged) so these SimHits are not written out as you suggested.
added master label
Hi @rosati I agree that of course that digitizing events passed 8BC is a waste of time and resources. However in this current case there is a unforeseen behaviour of the digitization caused by this time cut. The full (proper) solution would be to dig into the digitization to understand and fix where the issues were coming from, but for the time being simply making the time window larger in the digi works well to produce the expected results.
Of course this doesn't mean we should use a time window larger than 8BC post-digitization so we should also apply the time window in the reco (RDO->PRD converter as discussed). Once I confirm that the reco issues are identified and addressed then I will add the required timing cut in reco.
You also only include +-7 BC in the digi so I also do not understand how you can have timings that are longer than that.
Why is BX time 250 here: https://acode-browser1.usatlas.bnl.gov/lxr/source/athena/MuonSpectrometer/MuonDigitization/MM_Digitization/src/MM_ElectronicsResponseSimulation.cxx#0143
Hi @tadej There is a timing offset off 200ns introduced in the digi to emulate the peaking time of the electronics. This moves the whole distribution of the digits to the right in time by 200ns. e.g muon signals for the MM produce digits with t between 200-300ns because of the offset.
The detector will use 8BC which for the digi means t between 200-400ns. With the 200ns offset, the +-7 BC range for pileup covers the correct range of BCs (it may be a bit generous by a few BC in the + range but not by too large an amount). We of course want to consider 0-200ns to catch the pileup peaking the electornics prior to a real muon.
I will leave @pscholer answer the question in regards to what the BXtime is or why is set to 250ns.
Edited by Alexandre LaurierHi @alaurier,
I'm still not back to 100%, but the range for bunch-crossings to be pased to MM_DigitizationTool in the MM Digitization Config is -200ns to +200ns so [-8,+8] BC (i.e. including ends of the range) so where does +-7 BC come in?
@pscholer, could you comment about
BXtime
also please?Cheers,
John
Hi, I had a quick look through the code and the function where
BXtime
is used seems to be a leftover from old developments. I will talk to the other MM experts (mostly for the L1 simulation) but I would not worry about it at all. Cheers, PatrickEdited by Patrick ScholerThanks for checking @pscholer.
Hi all, I have a couple of additional comments/questions :
since the RDO are saved in a larger time window than real data, we should make sure the time selection is applied also in RDOtoRDOTrigger
those FPE have me worried, as they might indicate an underlying problem. Why would out of time hits have an invalid position ?
the shape of the pileup times is different from that you reported in the slides presented at the CP meeting last Monday for 21.3 (right plot in slide 3) which had the maximum at low values. Do you understand that ?
cheers, Tommaso
Hi @tlari ,
Yes the saved RDO time window is a point of concern we are discussing on where best to implement such time selection. I'm testing currently reco performance of applying such a cut in the RDO->PRD converters and we will properly propagate to the other crucial areas such as RDOtoRDOTrigger.
For the FPE, we agree thats it's worrisome. The simple adjustment in this MR seems to fix the incorrect timing distributions in the overlayed samples and fixes the original reconstruction issues we were seeing so it is already a very good first order solution. The new FPE warnings do seem to indicate a rarely-occuring issue in the MM digitization that needs to be understood and fixed. It could be as "simple" as the very high time limit allowing the digitization to model the very slow ionization of far away electrons and the positions are extrapolated to outside the MM or dead strips. These problematic strips should be caught in the digi but it may be fringe cases we dont test/look for yet.
In regards to the shape differences from 21.3: simply put there were a few significant improvements since that 2020 version of 21.3 . 1) the shape of the digitization output is no longer very peaked and more spread out (more in line w/ data) so you see less of the hard peaks at every BC. 2) An extra offset was added so the whole distribution is shifted to the right by 200ns. 3) Not confirmed but very likely: the signal peak in 21.3 after 100ns is much larger than it should be (32x larger to be exact). At that time, due to grid issues running presampling for enough events took too long so I would "cheat" by making hard-scatter events w/ 32 events-per-muon. This allowed us to both count the rate of background and test the reconstruction on muons mixed w/ pileup.
I'll draft up some slides and summarize the current status shortly.
Hi @alaurier, thanks for the update today. I think it should be fine to apply a time window, and probably do that already at the stage of preparing the RDO's as we said today. In this case the RDO's will be in the correct time window also in the following steps (RDO->PRD and RDO->RDO trigger). We'll probably have to check again the rates after applying this window, and make sure we are not getting too far from the expected ones. In fact I am not sure whether the digitization is now moving hits that should be in time, out of the window (in this case we would underestimate the background) or just producing hits that are really not in time ( in this case it would be no problem to remove them ). But ok I think we should first converge on a setup that is using the window of 200 ns i.e. 8 BC, with the right offset to get the signal in. thanks, cheers, Stefano
Hi everyone,
Another quick update. Looking at my logs during pileup presampling (digitization) I see several of these warnings appear: WARNING given Identifier 18445618190982578175 is not a MM Identifier, skipping
I just confirmed I see this only in presampling jobs with and without the change proposed in this MR.
Although the timing distribution seems fixed, it does seem like this MR worsens the above issue.
The warning in digitization stems from here . The digitization verifies the ID of the digit (the main strip) and returns this warning if its not a correct ID. Of course this should never happen overall... However it is problematic that it doesn't check the other strips of the digit and I'm inclined to believe this will allow invalid strips to be created (if the first one is ok).
I don't see these warnings in the overlay or regular digitization jobs, pointing to an issue in how the MM digitization handles the large time distribution of pileup hits (this goes hand-in-hand with the increased time window parameter proposed in this MR).
I see the MM digitization warnings in physics validation jobs (but not the FPE in downstream RDOtoRDOTrigger jobs). It seems to me that a separate JIRA ticket should exist for that issue, so I created one : https://its.cern.ch/jira/browse/ATLASSIM-5822
cheers, Tommaso
mentioned in merge request !53113 (merged)
mentioned in merge request !53127 (merged)
mentioned in merge request !53149 (merged)
mentioned in merge request !53151 (merged)
- Resolved by Patrick Scholer
Hi everyone,
To give some context before I close this draft MR, here is the current situation of things describing the work done since this MR was raised:
- The discovered problems in sTGC overlay distributions were caused by RDO conversion bugs and are fixed in !52717 (merged) !52873 (merged)
- The time of sTGC readout elements are trimmed in simulation to match the data time window in !53117 (merged)
- The major problem with the MM pileup/overlay strip time distribution which is presented in the first figure of this MR are fully addressed with !53151 (merged) . That MR implements a more sound and less ad-hoc approach to fixing the MM digitization issues than this draft, meaning we can ignore the changes in this MR. The MM data time window is also applied in this MR (which reverts some changes of !53117 (merged)). A series of validations plots are shown in the comments of the MR, showing the MR now produces the expected output distributions for pileup and overlay jobs.
- MM identifier warnings would show up in digitization with pileup. The issue is discussed in ATLASSIM-5822 and fixed in !53113 (merged) .
- The combination of points 3 and 4 solve the occurrences of FPE warnings in MM L1 TriggerRDOToRDO jobs. As far as I can tell with overlay samples of a few thousand events, the FPE warnings are gone or are rare enough to be missed by my tests.
- MR !53149 (merged) slightly slims the MM SimHitCollections to remove HITS which are not used during digitization.
- While validating all these points, we had identified occurrences of FPE warnings in the sTGC clusterization during reconstruction. We have identified and understand the source of these warnings and will setup a MR shortly which will be marked as urgent and need to be merged ASAP. The warnings are not new and stem from an unintended behaviour of the output from sTGC Digitization.
I will close this draft MR once all the above MRs are merged and if the experts keeping track of this draft are satisfied with the current improvements.
Tagging @pscholer @chchau @rosati @jojungge @stavrop to keep informed and to add anything I may have forgotten.
I created a MR to fix the FPE warning Alexandre mentions in (7) in his summary above: !53201 (merged)
removed urgent label