Add protection again nan input ranges for the RoiDescriptor
In the past algorithms have often calculated parameters incorrectly as nan, and then tried to ceate RoiDescriptors with these parameters, so the RoiDescriptor has explicit protection agains such nan input and throws an exception.
However, this check is only on the central values - the asumption being that the ranges are sensible, and so they will only be nan if the central value is nan, so they do not need to be independently checked.
However, there is a case of an attempt to create an RoiDescriptorthe with nan ranges but with sensible central values, so now we add the range checks also to the limits of the Roi.
This is in response to https://its.cern.ch/jira/browse/ATR-22017, although it will not fix the problem, since the problem is most likely in the miscalculation upstream of the ranges themselves by some algorithm, so the RoiDescriptor can not fix this, and must throw an exception itself.
The ultimate fix, must be to identify the algorithm creating the problematic parameters in the first place, otherwise the processing for this event would need to be abandoned.
Merge request reports
Activity
added Geometry Trigger master review-pending-level-1 labels
CI Result FAILURE (hash 699ab21c)Athena AthSimulation AthGeneration AnalysisBase externals cmake make required tests optional tests Full details available on this CI monitor view
Athena: number of compilation errors 0, warnings 0
AthSimulation: number of compilation errors 0, warnings 0
AthGeneration: number of compilation errors 0, warnings 0
AnalysisBase: number of compilation errors 0, warnings 0
For experts only: Jenkins output [CI-MERGE-REQUEST-CC7 20316]One of the "required" tests. claims that this job fails the "frozen tier0" policy ...
ERROR Py:diff-root INFO [Muon::CSCSimHitCollection_p3_CSC_Hits]: 10 leaves differ ERROR Py:diff-root INFO [Muon::MDTSimHitCollection_p3_MDT_Hits]: 10 leaves differ ERROR Py:diff-root INFO [Muon::RPCSimHitCollection_p3_RPC_Hits]: 10 leaves differ ERROR Py:diff-root INFO [Muon::TGCSimHitCollection_p4_TGC_Hits]: 10 leaves differ ERROR Your tag breaks the frozen tier0 policy in test s3505. See run_s3505/diff-root-s3505.HITS.log file for more information.
which I strongly doubt. However, since these changes are to protect against garbage input being given to the RoiDescriptor, if it was in the past, that some aspect of the muon code was producing such garbage that was not caught, then this would change the output, but this would then also point to a bigger problem of why the muon code was producing such output in the first place, so the quicker these changes can go in, the sooner it might start showing up problems that would be important for people to fix. This of course is assuming that this test is actually related to this change - it might very well have nothing to do with it, I haven't checked.
Another of the failures was a timeout, and the third wasn't due to these changes ...
AODtoDAOD 22:15:46 Py:AODSelect_setupOptions WARNING No MCEventCollection found even though the input file is simulation! AODtoDAOD 22:15:46 Py:AODSelect_setupOptions WARNING No TruthParticleContainer found even though the input file is simulation! AODtoDAOD 22:15:46 Py:Athena INFO including file "PyAnalysisCore/InitPyAnalysisCore.py" AODtoDAOD 22:15:47 input_line_397:8:19: error: redeclaration of using declaration AODtoDAOD 22:15:47 using TProfile::BufferFill; AODtoDAOD 22:15:47 ~~~~~~~~~~^
So the only potential issue would be the first, but if it is a poblem, and related to these changes, it is not something where any reference should be fixed, as that would be masking a potentially more serious problem. I suspect however, that this change may not be the cause, but if it is, then it would just be an example of this change potentially already doing it's intended job, and identifying more serious problems elsewhere, already before it is even in the release.
Cheers Mark
Edited by Mark Suttonthe failing
SimulationTier0Test_required-test
inAthSimulation
and the failingOverlayTier0Test_required-test
andRecoPHYSFormatTest_optional-test
inAthena
are part of the nightly (cf. ATLINFR-3722).added review-approved label and removed review-pending-level-1 label
Hi @sutt,
let me re-ask: Is the exception thrown by various check methods caught?
It it is only caught in by the generic exception-catching the event-loop we won't gain much by this update. The job will abort if a NAN is encountered.
- Walter
In the past I was under the impression that code to catch it was added when we added the range check in the HLT originally, although that would have been in the steering somewhere which I didn't write so know nothing more than that - I could quite easily be mistaken, but in any case, that has all changed now, and I have no idea what is done in the scheduler.
If it isn't caught, then presumably we would either need to add code to catch it, or change the type of the exception to one that is already caught. The option to not throw an exception isn't really available, since if the RoiDescriptor is given an nan then this is a failure of some upstream algorithm which would mean that the processing for that Roi can not continue. It would be better if algorithms never produced such output, but such is life.
Adding it as it is, shouldn't make too much difference, as it will either throw this exception, or a less well controlled floating point exception when it tries to use the nan, but if people don't want this check in, we can leave it out, it really doesn't make much difference to me, although personally I would leave it in, either as it is, or with some other exception type if we have one that should be used instead.
Cheers Mark
Hi Walter, Not sure who that would be - this nan could be produced anywhere, ie is it an artefact of the file reading ? Was it there in the original file ? and if so which algorithm was responsible ?
What was the original source of the file ? Was is Run 2 data from P1 ? Or a reprocessing, or MC production ? If the nan was actually produced by the trigger itself, and it was from Run 2 athena, then we would need to know the chain and the key for the RoiDescriptor to know from where it originated. If it was Run 3 athena, then the RoiDescriptor key by itself should be able to tell us. If it were produced by the trigger, I would suspect the muon reconstruction, but only because in the past it has always been the muon reconstruction that produced nans - but this could as easily mean it is less likely to be produced by them now if they have added checks of their own.
If it was actually produced in the trigger and the reading / writing is working correctly, then we would need someone who could dump the navigation so that we could know where it came from originally, before we could work out how to go about fixing it. For that, Tomasz might be the best person for Run 2 athena, or Tim for Run 3 athena.
Can you provide more details about the source of the data that is being read so we can know where to start looking ?
Cheers Mark
Edited by Mark SuttonHi @sutt ,
I just tested you change locally, and (as I feared) it makes things worse, at least from an outside perspective. The exception breaks the event loop and the job dies.
I remove the review-approved label to avoid that it gets accidentally merged.
Trying to answer your questions: The original file is data15_13TeV.00276689.physics_Main.daq.RAW._lb0349._SFO-6._0001.data, so a fairly old bytestream file. It is certainly possible that an nan was written to this file by the software running at P1 in 2015. If this is the case, please put a protection in the code that deals with it relatively silently, like writing a WARNING to the log. An exception or ERROR would translate into a job failure.
- Walter
removed review-approved label