Instability with x86_64_v3 builds for Gaudi Allen tests

changed milestone to %RTA/2024.06.03-JuneTS

@msaur FYI, these issues are currently a pain w.r.t. ref updates in Moore, just so you are aware when you take over.

@cagapopo @raaij @dovombru Any thought here ? We really need to try and address this as it currently seriously complicates ref updates in Moore.

mentioned in merge request Allen!1616 (merged)

RecoConf tests showing this issue (for a better bookkeeping):

allen_gaudi_forward_with_mcchecking 
allen_gaudi_pv_with_mcchecking 
allen_gaudi_seed_and_match_with_mcchecking 
allen_gaudi_velo_with_mcchecking 
hlt1_hlt2_comparison
hlt1_hlt2_pvs_vertex_compare

mentioned in merge request Allen!1623 (merged)

I had briefly taken a look at this a month ago and I couldn't reproduce the fluctuations locally with any of the affected build tags when running on the same machine. Unfortunately I didn't have the time to dig deeper, but it may be interesting to check whether these fluctuations are machine dependent (if I'm not mistaken there's different types of machines that are used for the tests and the job scheduling is random right?)

The logs should tell you which machines the tests run on, I guess. You could indeed see if the fluctuations correlate to specific nodes. If thats the case then we will need to investigate what specifically is different between the nodes in question. can you look into this, or if not find someone else from Allen to do this ?

b.t.w. even if it is just small differences due to different hardware the tests run on, some strategy to migrate it w.r.t. the test results needs to be found, as having the results fluctuate as they currently do is not acceptable from the RTA maintainer perspective.

Adding the WP2 coordinators (@mveghel @dovombru ) as I suspect this is an HLT1 reco issue so falls in their remit.

Can you please apply pressure here for someone to take a look ? Having these tests fluctuation is a real hinderance for MR merging, not to mention the potential impact on physics for whatever the underlying reason for the problem is.

if no one has time to look at it and it is not most critical right now for data taking, we can add some tolerance in the testing here, no? Of course shoving it under the carpet is not great, but would be an intermediate solution for this hassle for maintainers?

I don't know if just a single Gaudi counter can have its tolerance changed, without affecting others. The author of the algorithm with the effected counter would have to follow this up. its not just counters, there is printout from a monitor that would all need to have their tolerance changing

-PrimaryVertexChecker_a611350f          INFO 00 all                    :      283 from      528 (     772-244     ) [ 53.60 %], false   96 from reco.      379 (     283+96  ) [ 25.33 %]
+PrimaryVertexChecker_a611350f          INFO 00 all                    :      283 from      528 (     772-244     ) [ 53.60 %], false   95 from reco.      378 (     283+95  ) [ 25.13 %]
 PrimaryVertexChecker_a611350f          INFO 01 isolated               :      172 from      300 (     436-136     ) [ 57.33 %], false    0 from reco.      172 (     172+0   ) [  0.00 %]
-PrimaryVertexChecker_a611350f          INFO 02 close                  :      111 from      228 (     336-108     ) [ 48.68 %], false   96 from reco.      207 (     111+96  ) [ 46.38 %]
-PrimaryVertexChecker_a611350f          INFO 03 ntracks<10             :       13 from       55 (      55-0       ) [ 23.64 %], false   96 from reco.      109 (      13+96  ) [ 88.07 %]
+PrimaryVertexChecker_a611350f          INFO 02 close                  :      111 from      228 (     336-108     ) [ 48.68 %], false   95 from reco.      206 (     111+95  ) [ 46.12 %]
+PrimaryVertexChecker_a611350f          INFO 03 ntracks<10             :       13 from       55 (      55-0       ) [ 23.64 %], false   95 from reco.      108 (      13+95  ) [ 87.96 %]
 PrimaryVertexChecker_a611350f          INFO 04 ntracks>=10            :      270 from      473 (     473-0       ) [ 57.08 %], false    0 from reco.      270 (     270+0   ) [  0.00 %]
 PrimaryVertexChecker_a611350f          INFO 05 z<-50.0                :       71 from      110 (     171-61      ) [ 64.55 %], false   18 from reco.       89 (      71+18  ) [ 20.22 %]
-PrimaryVertexChecker_a611350f          INFO 06 z in (-50.0, 50.0)     :      151 from      305 (     432-127     ) [ 49.51 %], false   59 from reco.      210 (     151+59  ) [ 28.10 %]
+PrimaryVertexChecker_a611350f          INFO 06 z in (-50.0, 50.0)     :      151 from      305 (     432-127     ) [ 49.51 %], false   58 from reco.      209 (     151+58  ) [ 27.75 %]
 PrimaryVertexChecker_a611350f          INFO 07 z >=50.0               :       61 from      113 (     169-56      ) [ 53.98 %], false   19 from reco.       80 (      61+19  ) [ 23.75 %]
-PrimaryVertexChecker_a611350f          INFO 08 decayBeauty            :        2 from        4 (       4-0       ) [ 50.00 %], false    1 from reco.       98 (      97+1   ) [  1.02 %]
-PrimaryVertexChecker_a611350f          INFO 09 decayCharm             :       50 from       79 (      79-0       ) [ 63.29 %], false   22 from reco.      146 (     124+22  ) [ 15.07 %]
-PrimaryVertexChecker_a611350f          INFO 10 decayStrange           :      283 from      525 (     567-42      ) [ 53.90 %], false   95 from reco.      379 (     284+95  ) [ 25.07 %]
-PrimaryVertexChecker_a611350f          INFO 11 other                  :        0 from        3 (     205-202     ) [  0.00 %], false    1 from reco.       96 (      95+1   ) [  1.04 %]
+PrimaryVertexChecker_a611350f          INFO 08 decayBeauty            :        2 from        4 (       4-0       ) [ 50.00 %], false    1 from reco.       97 (      96+1   ) [  1.03 %]
+PrimaryVertexChecker_a611350f          INFO 09 decayCharm             :       50 from       79 (      79-0       ) [ 63.29 %], false   22 from reco.      145 (     123+22  ) [ 15.17 %]
+PrimaryVertexChecker_a611350f          INFO 10 decayStrange           :      283 from      525 (     567-42      ) [ 53.90 %], false   94 from reco.      378 (     284+94  ) [ 24.87 %]
+PrimaryVertexChecker_a611350f          INFO 11 other                  :        0 from        3 (     205-202     ) [  0.00 %], false    1 from reco.       95 (      94+1   ) [  1.05 %]
 PrimaryVertexChecker_a611350f          INFO 12 1MCPV                  :       66 from      100 (     100-0       ) [ 66.00 %], false   35 from reco.      101 (      66+35  ) [ 34.65 %]
@@ -22,3 +22,3 @@
 PrimaryVertexChecker_a611350f          INFO 14 3MCPV                  :       54 from       94 (     100-6       ) [ 57.45 %], false   15 from reco.       69 (      54+15  ) [ 21.74 %]
-PrimaryVertexChecker_a611350f          INFO 15 4MCPV                  :       35 from       76 (      99-23      ) [ 46.05 %], false   15 from reco.       50 (      35+15  ) [ 30.00 %]
+PrimaryVertexChecker_a611350f          INFO 15 4MCPV                  :       35 from       76 (      99-23      ) [ 46.05 %], false   14 from reco.       49 (      35+14  ) [ 28.57 %]
 PrimaryVertexChecker_a611350f          INFO 16 5MCPV                  :       30 from       61 (      90-29      ) [ 49.18 %], false    3 from reco.       33 (      30+3   ) [  9.09 %]
@@ -46,9 +46,9 @@
 PrimaryVertexChecker_a611350f          INFO 3_res_ntracks(10,30)      :  x: +0.092, y: +0.087, z: +0.251
-PrimaryVertexChecker_a611350f          INFO 4_res_ntracks>30          :  x: +0.089, y: +0.090, z: +0.232
+PrimaryVertexChecker_a611350f          INFO 4_res_ntracks>30          :  x: +0.088, y: +0.090, z: +0.233
 PrimaryVertexChecker_a611350f          INFO 5_res_z<-50               :  x: +0.092, y: +0.088, z: +0.221
-PrimaryVertexChecker_a611350f          INFO 6_res_z(-50,50)           :  x: +0.093, y: +0.090, z: +0.252
-PrimaryVertexChecker_a611350f          INFO 7_res_z>50                :  x: +0.084, y: +0.088, z: +0.203
+PrimaryVertexChecker_a611350f          INFO 6_res_z(-50,50)           :  x: +0.092, y: +0.090, z: +0.253
+PrimaryVertexChecker_a611350f          INFO 7_res_z>50                :  x: +0.085, y: +0.089, z: +0.203
 PrimaryVertexChecker_a611350f          INFO
-PrimaryVertexChecker_a611350f          INFO 1_pull_width_all          :  x: +2.639, y: +2.786, z: +2.472
-PrimaryVertexChecker_a611350f          INFO 1_pull_mean_all           :  x: +0.107, y: -0.002, z: +0.975
+PrimaryVertexChecker_a611350f          INFO 1_pull_width_all          :  x: +2.631, y: +2.794, z: +2.473
+PrimaryVertexChecker_a611350f          INFO 1_pull_mean_all           :  x: +0.096, y: -0.031, z: +0.975

Again, whomever is responsible for PrimaryVertexChecker would have to look into what is feasible here.

But yes, please do follow up with who ever is responsible for the algorithm(s) producing the instabilities to take a look into what can be done.

That's me, I'm taking a look.

Thanks.

I know it might seem petty to complain so much for a small diff like this, but until someone does the RTA maintainer shift its difficult to appreciate how much randomly changing tests like this can interfere with the process. Firstly, every time it happens in a ci test you have to go in and double check it is indeed this issue and not something new. Then, when making the ref updates these diffs can cause some havoc there as well.

There is currently an awful lot of MRs scheduled now in the milestone for the June TS release, so @msaur, and the maintainers after him, are going to have their work cut out to get through it all, so anything that smooths the process is a real benefit.

Yes, I understand that this can be very painful.

I've traced back the nightlies of the last two weeks for x86_64_v3-el9-gcc13+detdesc-opt+g. In 15 nightly builds, 8 had passing tests, 5 had the failures described above and 2 did not run for 2024-patches (which is the branch I checked).

In the 5 failures, we always have allen_gaudi_velo_with_mcchecking fail in addition to to allen_gaudi_pv_with_mcchecking. The velo tracking always has a different hit efficiency as cause, and most of the time also finds a different number of tracks, which are ghosts. In the PV finding, the mean and width of the pull distribution are different. We first have to understand the issue in the velo tracking, which is most likely causing the differences in the PV finding. The forward and seed_and_match also fail, because of PV counters changing.

This is probably linked to #607 (closed), so I will follow up there.

@dovombru something that I forgot to mention:

Allen!1483 (merged) changed the way online counters are registered in Allen, making them available also in Allen-via-Moore (and all corresponding tests). That means that before that MR, counters may have been already fluctuating, but they were just not being printed by the test, so the fluctuations may have been hidden from before.

@msaur if the fluctuations are making maintenance too difficult (especially in this heavy period), perhaps we could consider skipping comparisons for the affected counters (and efficiency results in some cases) for now? @dovombru wdyt?

TBH ignoring the counters again just sounds to me like a recipe for sweeping the issue under the carpet and forgetting about it. @msaur should indeed comment, but personally I would prefer to see some sort of a more detailed investigation before resorting to that.

I would agree with @jonrob that ignoring this problem (exclusions ) is surely not a way to go. Exclusion list is already rather long and too often no one is following various issues in the moment any warning/error is in the exclusion list.

Some solution is needed and given available resource it will hardly be anytime soon. But, the pattern in which this is happening is quite a well described, so I think it could be somehow bearable for now as majority of the MRs aiming for June TS should be a selection-related MRs and those in general should not requite any reference update (maybe I am too naive at this point).

If that would not work, then I would call for some solution.

From the technical point of view, my understanding is that if a reference update will be needed and these failed tests would be included, then relevant reference files should be removed, i.e. references should not be changed.

added failing build label

assigned to @dovombru and unassigned @raaij and @cagapopo

mentioned in issue #607 (closed)

mentioned in merge request Allen!1509 (closed)

mentioned in merge request Allen!1635 (merged)

assigned to @ahennequ

changed milestone to %RTA/2024.08.12-AugMD

mentioned in merge request Allen!1678 (merged)

RecoConf_allen_gaudi_seed_and_match_with_ut_with_mccheckingis now fluctuating as well, which is probably expected as it is very similar to other fluctuating tests, for example lhcb-2024-patches-mr. To be tested together with Allen!1678 (merged).

closed

Instability with x86_64_v3 builds for Gaudi Allen tests

Designs

Child items 0

Activity