Add protection again nan input ranges for the RoiDescriptor

unmarked as a Work In Progress

changed the description

This merge request affects 1 package:

DetectorDescription/RoiDescriptor

Adding @sutt ,@nkoehler ,@rbianchi as watchers

added Geometry Trigger master review-pending-level-1 labels

CI Result FAILURE (hash 699ab21c)

	Athena	AthSimulation	AthGeneration	AnalysisBase
externals
cmake
make
required tests
optional tests

Full details available on this CI monitor view
Athena: number of compilation errors 0, warnings 0
AthSimulation: number of compilation errors 0, warnings 0
AthGeneration: number of compilation errors 0, warnings 0
AnalysisBase: number of compilation errors 0, warnings 0
For experts only: Jenkins output [CI-MERGE-REQUEST-CC7 20316]

One of the "required" tests. claims that this job fails the "frozen tier0" policy ...

ERROR    Py:diff-root         INFO  [Muon::CSCSimHitCollection_p3_CSC_Hits]: 10 leaves differ
ERROR    Py:diff-root         INFO  [Muon::MDTSimHitCollection_p3_MDT_Hits]: 10 leaves differ
ERROR    Py:diff-root         INFO  [Muon::RPCSimHitCollection_p3_RPC_Hits]: 10 leaves differ
ERROR    Py:diff-root         INFO  [Muon::TGCSimHitCollection_p4_TGC_Hits]: 10 leaves differ
ERROR    Your tag breaks the frozen tier0 policy in test s3505. See run_s3505/diff-root-s3505.HITS.log file for more information.

which I strongly doubt. However, since these changes are to protect against garbage input being given to the RoiDescriptor, if it was in the past, that some aspect of the muon code was producing such garbage that was not caught, then this would change the output, but this would then also point to a bigger problem of why the muon code was producing such output in the first place, so the quicker these changes can go in, the sooner it might start showing up problems that would be important for people to fix. This of course is assuming that this test is actually related to this change - it might very well have nothing to do with it, I haven't checked.

Another of the failures was a timeout, and the third wasn't due to these changes ...

AODtoDAOD 22:15:46 Py:AODSelect_setupOptions WARNING No MCEventCollection found even though the input file is simulation!
AODtoDAOD 22:15:46 Py:AODSelect_setupOptions WARNING No TruthParticleContainer found even though the input file is simulation!
AODtoDAOD 22:15:46 Py:Athena            INFO including file "PyAnalysisCore/InitPyAnalysisCore.py"
AODtoDAOD 22:15:47 input_line_397:8:19: error: redeclaration of using declaration
AODtoDAOD 22:15:47   using TProfile::BufferFill;
AODtoDAOD 22:15:47         ~~~~~~~~~~^

So the only potential issue would be the first, but if it is a poblem, and related to these changes, it is not something where any reference should be fixed, as that would be masking a potentially more serious problem. I suspect however, that this change may not be the cause, but if it is, then it would just be an example of this change potentially already doing it's intended job, and identifying more serious problems elsewhere, already before it is even in the release.

Cheers Mark

changed title from Add protection again nan input rnages for the RoiDescriptor to Add protection again nan input ranges for the RoiDescriptor

the failing SimulationTier0Test_required-test in AthSimulation and the failing OverlayTier0Test_required-test and RecoPHYSFormatTest_optional-test in Athena are part of the nightly (cf. ATLINFR-3722).

Many Nico, in that case, I guess this means that this change is good to go.

Cheers Mark

I see these test failing elsewhere too. I double-checked and I agree that this is unrelated so approving.

added review-approved label and removed review-pending-level-1 label

Hi @sutt

is the exception thrown by various check methods caught anywhere?

Walter

Hi @sutt,

let me re-ask: Is the exception thrown by various check methods caught?

It it is only caught in by the generic exception-catching the event-loop we won't gain much by this update. The job will abort if a NAN is encountered.

Walter

In the past I was under the impression that code to catch it was added when we added the range check in the HLT originally, although that would have been in the steering somewhere which I didn't write so know nothing more than that - I could quite easily be mistaken, but in any case, that has all changed now, and I have no idea what is done in the scheduler.

If it isn't caught, then presumably we would either need to add code to catch it, or change the type of the exception to one that is already caught. The option to not throw an exception isn't really available, since if the RoiDescriptor is given an nan then this is a failure of some upstream algorithm which would mean that the processing for that Roi can not continue. It would be better if algorithms never produced such output, but such is life.

Adding it as it is, shouldn't make too much difference, as it will either throw this exception, or a less well controlled floating point exception when it tries to use the nan, but if people don't want this check in, we can leave it out, it really doesn't make much difference to me, although personally I would leave it in, either as it is, or with some other exception type if we have one that should be used instead.

Cheers Mark

Hi @sutt ,

I fully agree that the real bug is elsewhere. Can't you re-assign the original bug-report ATR-22017 to whoever is responsible for the produces the Nan?

Walter

Hi Walter, Not sure who that would be - this nan could be produced anywhere, ie is it an artefact of the file reading ? Was it there in the original file ? and if so which algorithm was responsible ?

What was the original source of the file ? Was is Run 2 data from P1 ? Or a reprocessing, or MC production ? If the nan was actually produced by the trigger itself, and it was from Run 2 athena, then we would need to know the chain and the key for the RoiDescriptor to know from where it originated. If it was Run 3 athena, then the RoiDescriptor key by itself should be able to tell us. If it were produced by the trigger, I would suspect the muon reconstruction, but only because in the past it has always been the muon reconstruction that produced nans - but this could as easily mean it is less likely to be produced by them now if they have added checks of their own.

If it was actually produced in the trigger and the reading / writing is working correctly, then we would need someone who could dump the navigation so that we could know where it came from originally, before we could work out how to go about fixing it. For that, Tomasz might be the best person for Run 2 athena, or Tim for Run 3 athena.

Can you provide more details about the source of the data that is being read so we can know where to start looking ?

Cheers Mark

Hi @sutt ,

I just tested you change locally, and (as I feared) it makes things worse, at least from an outside perspective. The exception breaks the event loop and the job dies.

I remove the review-approved label to avoid that it gets accidentally merged.

Trying to answer your questions: The original file is data15_13TeV.00276689.physics_Main.daq.RAW._lb0349._SFO-6._0001.data, so a fairly old bytestream file. It is certainly possible that an nan was written to this file by the software running at P1 in 2015. If this is the case, please put a protection in the code that deals with it relatively silently, like writing a WARNING to the log. An exception or ERROR would translate into a job failure.

Walter

removed review-approved label

Adding user-required so this MR is properly tracked by the system.

Add protection again nan input ranges for the RoiDescriptor

Merge request reports

Activity