Draft: Fix MM pileup overlay time distribution

added Digitization MuonSpectrometer Overlay urgent labels

This should be a temporary fix, because the time window has to be reduced to fit inside a 8 bunch crossing window, i.e. 200 ns. This conflicts with the calibration interface currently in development, which assume a 8BC window. We need advice on how to proceed.

For some information on the new FPE warnings in HLT MM L1 trigger simulation jobs. We have not been able to reproduce the FPE warnings when using the value of WindowUpperOffset of 500 set in this MR. The warnings only appear when we make the value larger to 1000. We do not understand this and may point to an issue in the digitization.

The HLT trigger command is taken from (ATR-25352). The first instance of the FPE warnings occurred when running on ttbar+pileup RDOs produced following this command. I have also included an overlayed muon + pileup RDO file and HTL trigger job command in this shared directory: /afs/cern.ch/user/a/alaurier/public/FPEWarning

A snippet of the FPE warning is found here: 71743 RDOtoRDOTrigger 22:11:20 FPEAuditor 13 0 WARNING FPE DIVBYZERO in [Execute] of [NSWL1Simulation] on event 47672012 1 0x7f9c3eea8d64 71744 RDOtoRDOTrigger 22:11:20 FPEAuditor 13 0 INFO FPE stacktrace 0 : 71745 RDOtoRDOTrigger 22:11:20 in function : std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_tra its<char> >&, char const*) (/cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-8a51a/x86_64-centos7/include/c++/11.2.0/ostream:611) 71746 RDOtoRDOTrigger 22:11:20 included from : fpe_sig_action (/build/atnight/localbuilds/nightlies/Athena/master/athena/Control/AthenaAuditors/src/FPEAudit_linux.icc:80) 71747 RDOtoRDOTrigger 22:11:20 in library : /cvmfs/atlas-nightlies.cern.ch/repo/sw/master_Athena_x86_64-centos7-gcc11-opt/2022-04-28T2101/Athena/22.0.65/InstallArea/x86_64-cent os7-gcc11-opt/lib/libAthenaAuditors.so 71748 RDOtoRDOTrigger 22:11:20 FPEAuditor 13 0 INFO FPE stacktrace 1 : 71749 RDOtoRDOTrigger 22:11:20 in function : __restore_rt (sigaction.c:0) 71750 RDOtoRDOTrigger 22:11:20 in library : /lib64/libpthread.so.0 71751 RDOtoRDOTrigger 22:11:40 FPEAuditor 13 0 INFO FPE stacktrace 2 : 71752 RDOtoRDOTrigger 22:11:40 in function : MMT_Hit::MMT_Hit(char, hitData_entry, MuonGM::MuonDetectorManager const*, std::shared_ptr<MMT_Parameters>) (/build/atnight/localbui lds/nightlies/Athena/master/athena/Trigger/TrigT1/TrigT1NSWSimTools/src/MMT_Hit.cxx:51) 71753 RDOtoRDOTrigger 22:11:40 in library : /cvmfs/atlas-nightlies.cern.ch/repo/sw/master_Athena_x86_64-centos7-gcc11-opt/2022-04-28T2101/Athena/22.0.65/InstallArea/x86_64-cent os7-gcc11-opt/lib/libTrigT1NSWSimToolsLib.so 71754 RDOtoRDOTrigger 22:11:44 FPEAuditor 13 0 INFO FPE stacktrace 3 : 71755 RDOtoRDOTrigger 22:11:44 in function : std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (/cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-8a51a/x86_64-centos7 /include/c++/11.2.0/bits/shared_ptr_base.h:704) 71756 RDOtoRDOTrigger 22:11:44 included from : std::__shared_ptr<MMT_Parameters, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() (/cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-8a51a/ x86_64-centos7/include/c++/11.2.0/bits/shared_ptr_base.h:1154)

The issue points to this file and line. The global Z position of a strip is 0 and division by 0 occurs, triggering the FPE warning.

The FPE warning follows several of these warnings: 71671 RDOtoRDOTrigger 22:11:14 MMLoadVariables 13 0 WARNING MicroMegas digitization: failed to associate a valid local position for (chip response) strip n. 54; associated positions will be set to 0.0. stemming from this file and line.

The local position of a MM strip is trying to be evaluated, but the query fails and then the subsequent local->global transform returns a 0 global Z coordinate, resulting in the FPE warning.

The function to get MM strip local positions should fail to retrieve the position for dead, pcb or routed strips (by definition). This may be a good starting point. To my knowledge, this strips should also never be created during the digitization so it is surprising that they now exist in the RDO. Somehow, increasing the digitization time window too much creates strips which are outside the valid range of strips on which to return the local positions.

Extra comment on May 9th 2022: In some cases in pileup premixing (and standard pileup jobs), the HITS can have kinetic energy equal to zero and these were not filtered out and caused the invalid digits and the MM digitization warnings.

I'm producing pileup samples overnight to validate if this is the source of all the MM digitization warnings and then if this also gets rid of all the FPE warnings.

Cheers, Alexandre

Hi Alexandre that's encouraging, but I would like to understand better what is going on. Are you saying that these HITS have a valid link to a particle whose kinetic energy is zero ? The only good reason I can think for that is if the particle caem to rest inside the sensitive volume. But why does the digitization create an invalid identifier in this case, and why should such energy deposit be ignored ?

cheers, Tommaso

Hi @tlari ,

Yes some HITS have zero kinetic energy from the post-step G4 values. I've identified some of these particles as protons, deuterons and argon nuclei.

The issue arises in this function, where the strips from all the hits on a certain layer are combined into 1 container: https://acode-browser1.usatlas.bnl.gov/lxr/source/athena/MuonSpectrometer/MuonDigitization/MM_Digitization/src/MM_DigitizationTool.cxx#1189

The Identifier is initialized but not assigned to anything by default and in the rare cases where a layer sees 1 HIT with strips where the KE is zero, the identifier is never assigned a value. Somehow, this null pointer returns a MDT identifier in the following logic of the DigitizationTool. I assume the rate of particles with KE=0 is much higher in pileup than hard-scatter events and why these warnings have only been noticed in the digitization of pileup.

It's simply a logic problem and the problem can be solved either by ignoring HITS with KE=0 or by initializing the identifier in the function I linked above to the Identifier of the first container. It's a question I am raising with the MM experts now.

Cheers, Alexandre

Hi @alaurier,

If SimHits caused by particles with KE=0 should be ignored in digi, then we can save some time by putting a cut in the Sensitive Detector and not writing those SimHits out in the first place maybe? https://gitlab.cern.ch/atlas/athena/-/blob/master/MuonSpectrometer/MuonG4/MuonG4SD/src/MicromegasSensitiveDetector.cxx#L49

Cheers,

John

Hi @jchapman ,

Good point. Seeing how we will now be ignoring MM SimHits with KE=0, I have made changes in !53149 (merged) so these SimHits are not written out as you suggested.

added master label

hi @alaurier

thanks for the update. I think we should keep the 8 BC window as it's expected in real data though, maybe trying to understand why we have those strips going beyond 400 in time ?

I am adding also @fgravili and @vadamico as L1 trigger experts so that they can follow.

cheers, stefano

Hi @rosati I agree that of course that digitizing events passed 8BC is a waste of time and resources. However in this current case there is a unforeseen behaviour of the digitization caused by this time cut. The full (proper) solution would be to dig into the digitization to understand and fix where the issues were coming from, but for the time being simply making the time window larger in the digi works well to produce the expected results.

Of course this doesn't mean we should use a time window larger than 8BC post-digitization so we should also apply the time window in the reco (RDO->PRD converter as discussed). Once I confirm that the reco issues are identified and addressed then I will add the required timing cut in reco.

You also only include +-7 BC in the digi so I also do not understand how you can have timings that are longer than that.

Why is BX time 250 here: https://acode-browser1.usatlas.bnl.gov/lxr/source/athena/MuonSpectrometer/MuonDigitization/MM_Digitization/src/MM_ElectronicsResponseSimulation.cxx#0143

Hi @tadej There is a timing offset off 200ns introduced in the digi to emulate the peaking time of the electronics. This moves the whole distribution of the digits to the right in time by 200ns. e.g muon signals for the MM produce digits with t between 200-300ns because of the offset.

The detector will use 8BC which for the digi means t between 200-400ns. With the 200ns offset, the +-7 BC range for pileup covers the correct range of BCs (it may be a bit generous by a few BC in the + range but not by too large an amount). We of course want to consider 0-200ns to catch the pileup peaking the electornics prior to a real muon.

I will leave @pscholer answer the question in regards to what the BXtime is or why is set to 250ns.

Hi @alaurier,

I'm still not back to 100%, but the range for bunch-crossings to be pased to MM_DigitizationTool in the MM Digitization Config is -200ns to +200ns so [-8,+8] BC (i.e. including ends of the range) so where does +-7 BC come in?

@pscholer, could you comment about BXtime also please?

Cheers,

John

Hi @jchapman ,

Sorry yes the +-7 range was referring to +-8.

Patrick is mostly offline for the rest of the week. The answer here might not be immediate.

Hi, I had a quick look through the code and the function where BXtime is used seems to be a leftover from old developments. I will talk to the other MM experts (mostly for the L1 simulation) but I would not worry about it at all. Cheers, Patrick

Thanks for checking @pscholer.

Hi all, I have a couple of additional comments/questions :

since the RDO are saved in a larger time window than real data, we should make sure the time selection is applied also in RDOtoRDOTrigger

those FPE have me worried, as they might indicate an underlying problem. Why would out of time hits have an invalid position ?

the shape of the pileup times is different from that you reported in the slides presented at the CP meeting last Monday for 21.3 (right plot in slide 3) which had the maximum at low values. Do you understand that ?

cheers, Tommaso

Hi @tlari ,

Yes the saved RDO time window is a point of concern we are discussing on where best to implement such time selection. I'm testing currently reco performance of applying such a cut in the RDO->PRD converters and we will properly propagate to the other crucial areas such as RDOtoRDOTrigger.

For the FPE, we agree thats it's worrisome. The simple adjustment in this MR seems to fix the incorrect timing distributions in the overlayed samples and fixes the original reconstruction issues we were seeing so it is already a very good first order solution. The new FPE warnings do seem to indicate a rarely-occuring issue in the MM digitization that needs to be understood and fixed. It could be as "simple" as the very high time limit allowing the digitization to model the very slow ionization of far away electrons and the positions are extrapolated to outside the MM or dead strips. These problematic strips should be caught in the digi but it may be fringe cases we dont test/look for yet.

In regards to the shape differences from 21.3: simply put there were a few significant improvements since that 2020 version of 21.3 . 1) the shape of the digitization output is no longer very peaked and more spread out (more in line w/ data) so you see less of the hard peaks at every BC. 2) An extra offset was added so the whole distribution is shifted to the right by 200ns. 3) Not confirmed but very likely: the signal peak in 21.3 after 100ns is much larger than it should be (32x larger to be exact). At that time, due to grid issues running presampling for enough events took too long so I would "cheat" by making hard-scatter events w/ 32 events-per-muon. This allowed us to both count the rate of background and test the reconstruction on muons mixed w/ pileup.

I'll draft up some slides and summarize the current status shortly.

Hi @alaurier, thanks for the update today. I think it should be fine to apply a time window, and probably do that already at the stage of preparing the RDO's as we said today. In this case the RDO's will be in the correct time window also in the following steps (RDO->PRD and RDO->RDO trigger). We'll probably have to check again the rates after applying this window, and make sure we are not getting too far from the expected ones. In fact I am not sure whether the digitization is now moving hits that should be in time, out of the window (in this case we would underestimate the background) or just producing hits that are really not in time ( in this case it would be no problem to remove them ). But ok I think we should first converge on a setup that is using the window of 200 ns i.e. 8 BC, with the right offset to get the signal in. thanks, cheers, Stefano

Hi everyone,

Another quick update. Looking at my logs during pileup presampling (digitization) I see several of these warnings appear: WARNING given Identifier 18445618190982578175 is not a MM Identifier, skipping

I just confirmed I see this only in presampling jobs with and without the change proposed in this MR.

Although the timing distribution seems fixed, it does seem like this MR worsens the above issue.

The warning in digitization stems from here . The digitization verifies the ID of the digit (the main strip) and returns this warning if its not a correct ID. Of course this should never happen overall... However it is problematic that it doesn't check the other strips of the digit and I'm inclined to believe this will allow invalid strips to be created (if the first one is ok).

I don't see these warnings in the overlay or regular digitization jobs, pointing to an issue in how the MM digitization handles the large time distribution of pileup hits (this goes hand-in-hand with the increased time window parameter proposed in this MR).

But presampling jobs are standard digi with different output names for muons so I'm a bit confused. Could it be that in case of presampling we do not have hard-scatter and digits are created differently?

I see the MM digitization warnings in physics validation jobs (but not the FPE in downstream RDOtoRDOTrigger jobs). It seems to me that a separate JIRA ticket should exist for that issue, so I created one : https://its.cern.ch/jira/browse/ATLASSIM-5822

cheers, Tommaso

mentioned in merge request !53113 (merged)

mentioned in merge request !53127 (merged)

mentioned in merge request !53149 (merged)

I've been running more tests today and the different improvements to the MM discussed in this MR and others in the thread seem to address the MM trigger related FPE warnings in the L1 trigger jobs.

mentioned in merge request !53151 (merged)

Hi everyone,

To give some context before I close this draft MR, here is the current situation of things describing the work done since this MR was raised:

The discovered problems in sTGC overlay distributions were caused by RDO conversion bugs and are fixed in !52717 (merged) !52873 (merged)
The time of sTGC readout elements are trimmed in simulation to match the data time window in !53117 (merged)
The major problem with the MM pileup/overlay strip time distribution which is presented in the first figure of this MR are fully addressed with !53151 (merged) . That MR implements a more sound and less ad-hoc approach to fixing the MM digitization issues than this draft, meaning we can ignore the changes in this MR. The MM data time window is also applied in this MR (which reverts some changes of !53117 (merged)). A series of validations plots are shown in the comments of the MR, showing the MR now produces the expected output distributions for pileup and overlay jobs.
MM identifier warnings would show up in digitization with pileup. The issue is discussed in ATLASSIM-5822 and fixed in !53113 (merged) .
The combination of points 3 and 4 solve the occurrences of FPE warnings in MM L1 TriggerRDOToRDO jobs. As far as I can tell with overlay samples of a few thousand events, the FPE warnings are gone or are rare enough to be missed by my tests.
MR !53149 (merged) slightly slims the MM SimHitCollections to remove HITS which are not used during digitization.
While validating all these points, we had identified occurrences of FPE warnings in the sTGC clusterization during reconstruction. We have identified and understand the source of these warnings and will setup a MR shortly which will be marked as urgent and need to be merged ASAP. The warnings are not new and stem from an unintended behaviour of the output from sTGC Digitization.

I will close this draft MR once all the above MRs are merged and if the experts keeping track of this draft are satisfied with the current improvements.

Tagging @pscholer @chchau @rosati @jojungge @stavrop to keep informed and to add anything I may have forgotten.

I created a MR to fix the FPE warning Alexandre mentions in (7) in his summary above: !53201 (merged)

Hi @alaurier,

can this MR be closed now that it seems all additional MRs listed in your summary have been merged? If you still need it to be open for some reason, can at least the urgent label be removed since it's a bit confusing?

Thanks, Jakub (L1 shifter)

Hi @jkremer ,

I just closed the MR and remove the urgent tag. This one fell through the cracks, thanks for the reminder.

removed urgent label

closed

Draft: Fix MM pileup overlay time distribution

Closed by Alexandre Laurier 2 years ago (Jun 9, 2022 1:08pm UTC) 2 years ago

Activity

Draft: Fix MM pileup overlay time distribution

Merge request reports

Closed by Alexandre Laurier 2 years ago (Jun 9, 2022 1:08pm UTC) 2 years ago

Activity