Cannot get consistent results for S28r2 and S34r0p1 locally

added Nightlies bug fix cleanup legacy labels

mentioned in merge request DaVinci!427 (merged)

changed the description

It may be worth updating this issue with the results of the rate tests that were reported by @atully at PAC today. At the same meeting, it was decided to proceed with marking the stripping24e2-28r2-patches branch obsolete, since this issue is seems unrelated to the move to 2018-patches.

closed

reopened

changed the description

mentioned in merge request DaVinci!693 (closed)

assigned to @fredi and @ngrieser and unassigned @atully and @nskidmor

assigned to @avenkate

@avenkate were you aware of this?

Yeah, either you or Nate had pointed it to me. We do indeed see issues related to StrippingZVTOP in Stripping34.

@fredi, @ngrieser stdout_1 stdout_2

Here, I’ll break down the reproducibility issues seen with the StrippingBu2LLK line. These issues are seen in the CounterMismatches sections of some nightly tests (for example see the test_stripping34r0p1_collision18_reco18 in test 1287) in Stripping34r0p1 (though likely it affects other campaigns too) as

(KstarsPlusLLForBu2LLK ref) "# selected" | 100 | 198 | 1.9800 | 4.5321 | 0.0000 | 37.000
(KstarsPlusLLForBu2LLK new) "# selected" | 100 | 199 | 1.9900 | 4.5332 | 0.0000 | 37.000
(MergeBu2LLK_ee4 ref) "# Phys/KstarsPlusLLForBu2LLK" | 42 | 41 | 0.97619 | 2.1212 | 0.0000 | 12.000
(MergeBu2LLK_ee4 new) "# Phys/KstarsPlusLLForBu2LLK" | 42 | 42 | 1.0000 | 2.1381 | 0.0000 | 12.000
(MergeBu2LLK_ee4 ref) "#passed" | 42 | 1007 | 23.976 | 12.796 | 2.0000 | 68.000
(MergeBu2LLK_ee4 new) "#passed" | 42 | 1008 | 24.000 | 12.819 | 2.0000 | 68.000
(Bu2LLK_eeLine4 ref) "# Phys/MergeBu2LLK_ee4" | 42 | 1007 | 23.976 | 12.796 | 2.0000 | 68.000
(Bu2LLK_eeLine4 new) "# Phys/MergeBu2LLK_ee4" | 42 | 1008 | 24.000 | 12.819 | 2.0000 | 68.000

So the problem shows up in the B+ -> J/psi K*+, K*+ -> KS0 pi+ sector, where a slightly different number of K*+, and therefore a slightly different number of B+ are created from run to run.

Two sources of Ks0 are used: StdVeryLooseKsLL (the usual Ks0 container) and StdKs2PiPiLL (a container of Ks0 from Brunel). These are both passed (see lines 476-480) to the _makeKstarPlus function. Inside this function (see line 1028 onwards), a MergedSelection is done on the two Ks0 sources to combine them, with Unique=True. The Unique=True flag activates the usage of FilterUnique. The purpose of FilterUnique is to reject cases where two Ks0 particles share exactly the same daughters. The FilterUnique code orders the Ks0 according to memory address, and keeps the one of the Ks0 duplicates which comes last in memory address. This changes from run to run.

Normally this would not be a problem, but it looks like slightly different reconstruction algorithms are running to make the Brunel container than what make the normal container. I verify this by printing out the vertex chi2 of the Ks0 candidates. When there are duplicates, they have slightly different properties. Then FilterUnique randomly keeps one of these. Thus we always have the same number of Ks0 passing, but their properties are slightly different. This translates in a slightly different number of K*+ candidates passing the MotherCut in the CombineParticles inside _makeKstarPlus, and then in a slightly different number of B’s.

Below are snippet of logs (full logs stdout_1 and stdout_2 attached) from two consecutive runs of the stripping line on the same input data. I modified the FilterUnique code to print out the addresses and some other properties of the candidates in its input, good and filtered containers. The removal of duplicates happens between the good and filtered containers. We see a case of duplicate Ks0 with the same daughters (having keys 4 and 24). But one has a vertex chi2 of 0.323 and other has 0.321, and the FilterUnique randomly saved a different one in both runs. Note in the snippets below, the container is erroneously labeled MergedKstarsPlusLLForBu2LLK when in fact it contains Ks0.

Run no.1:

MergedKstarsPlusLLForBu2LLK   SUCCESS input :
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x4b1a2800
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x4b1a2e00
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x47ae7a00
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x448d4a00
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x43db2600
MergedKstarsPlusLLForBu2LLK   SUCCESS good
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x43db2600 Ks0 vertexchi2 = 1.18776 daughter keys : 24 daughter keys : 49
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x448d4a00 Ks0 vertexchi2 = 0.168203 daughter keys : 32 daughter keys : 23
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x47ae7a00 Ks0 vertexchi2 = 3.14633 daughter keys : 4 daughter keys : 38
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x4b1a2800 Ks0 vertexchi2 = 0.323052 daughter keys : 4 daughter keys : 24
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x4b1a2e00 Ks0 vertexchi2 = 0.321893 daughter keys : 4 daughter keys : 24
MergedKstarsPlusLLForBu2LLK   SUCCESS filtered
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x43db2600 Ks0 vertexchi2 = 1.18776 daughter keys : 24 daughter keys : 49
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x448d4a00 Ks0 vertexchi2 = 0.168203 daughter keys : 32 daughter keys : 23
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x47ae7a00 Ks0 vertexchi2 = 3.14633 daughter keys : 4 daughter keys : 38
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x4b1a2e00 Ks0 vertexchi2 = 0.321893 daughter keys : 4 daughter keys : 24

Run no.2:

MergedKstarsPlusLLForBu2LLK   SUCCESS input :
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3b756400
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3359e400
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x335a0c00
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3b75ee00
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3b761000
MergedKstarsPlusLLForBu2LLK   SUCCESS good
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3359e400 Ks0 vertexchi2 = 0.321893 daughter keys : 4 daughter keys : 24
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x335a0c00 Ks0 vertexchi2 = 3.14633 daughter keys : 4 daughter keys : 38
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3b756400 Ks0 vertexchi2 = 0.323052 daughter keys : 4 daughter keys : 24
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3b75ee00 Ks0 vertexchi2 = 0.168203 daughter keys : 23 daughter keys : 32
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3b761000 Ks0 vertexchi2 = 1.18776 daughter keys : 49 daughter keys : 24
MergedKstarsPlusLLForBu2LLK   SUCCESS filtered
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x335a0c00 Ks0 vertexchi2 = 3.14633 daughter keys : 4 daughter keys : 38
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3b756400 Ks0 vertexchi2 = 0.323052 daughter keys : 4 daughter keys : 24
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3b75ee00 Ks0 vertexchi2 = 0.168203 daughter keys : 23 daughter keys : 32
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3b761000 Ks0 vertexchi2 = 1.18776 daughter keys : 49 daughter keys : 24

Now, this just shows up for B+ -> J/psi K*+, K*+ -> Ks0 but in principle it affects all cases where the combination of Brunel candidates and normal candidates is done with FilterUnique.

Maybe we can get @ibelyaev and @pkoppenb involved since these is now going beyond stripping. Guys could you have a look and maybe share your thoughts here?

assigned to @erodrigu

changed due date to July 17, 2023

Good catch!!! So we have to find another, deterministic way of filtering unique. Would it introduce an unacceptable bias if we always selected the StdVeryLooseKsLL option in the case where the candidates are otherwise identical?

Well, I think that, if the candidates are identical, always selecting the StdVeryLooseKsLL should not create a physics bias

They are almost identical. If they were identical, the random selection wouldn't matter. It's a kind of multiple candidate selection, but a good one as we can emulate it on signal (the annoying ones are those that have interference between signal and background). This selection has an efficiency on signal, which can be measured in MC. Giving priority to =StdVeryLooseKsLL= would also have an efficiency (likely almost the same) and can also be measured on signal MC. So no problem.

Does the filter algorithm allow to give priority to a given container? Or does that need some C++ hacking?

As it is right now, FilterUnique doesn't allow to give priority to a given container. It just sees one input list of particles.

Are these Particles? If so, they are KeyedObjects, so could select on the one with the lowest ->key(), which should be unique?

Yes these are LHCb::Particle's. Would selecting by key() allow for consistency between data and MC?

Yes, in the sense that it's as random as selecting on pointer, it is just the order in which the particles are created and entered in the original container. But needs to be checked if it's really unique: since in this LHCb::Particle::ConstVector you have Particles coming from two different KeyedContainers, it may be that you have duplicates (the key is set when a Particle is added to a KeyedContainer)

I checked this, and unfortunately the key() is not unique

MergedKstarsPlusLLForBu2LLK   SUCCESS input :
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff40c00 Ks0 key : 0
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff47e00 Ks0 key : 1
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x48a95e00 Ks0 key : 0
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff3e800 Ks0 key : 1
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff3ea00 Ks0 key : 2
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff3ec00 Ks0 key : 3
MergedKstarsPlusLLForBu2LLK   SUCCESS good
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff3e800 Ks0 key : 1 Ks0 vertexchi2 = 2.16089 daughter keys 22 23
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff3ea00 Ks0 key : 2 Ks0 vertexchi2 = 0.799358 daughter keys 17 24
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff3ec00 Ks0 key : 3 Ks0 vertexchi2 = 2.47306 daughter keys 42 23
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff40c00 Ks0 key : 0 Ks0 vertexchi2 = 0.798841 daughter keys 17 24
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff47e00 Ks0 key : 1 Ks0 vertexchi2 = 1.27869 daughter keys 10 27
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x48a95e00 Ks0 key : 0 Ks0 vertexchi2 = 0.802838 daughter keys 12 52
MergedKstarsPlusLLForBu2LLK   SUCCESS filtered
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff3e800 Ks0 key : 1 Ks0 vertexchi2 = 2.16089 daughter keys : 22 23
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff3ec00 Ks0 key : 3 Ks0 vertexchi2 = 2.47306 daughter keys : 42 23
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff40c00 Ks0 key : 0 Ks0 vertexchi2 = 0.798841 daughter keys : 17 24
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x3ff47e00 Ks0 key : 1 Ks0 vertexchi2 = 1.27869 daughter keys : 10 27
MergedKstarsPlusLLForBu2LLK   SUCCESS Ks0 address : 0x48a95e00 Ks0 key : 0 Ks0 vertexchi2 = 0.802838 daughter keys : 12 52

This is all a really impressive cacth @avenkate ! Is it simple enough to check that the usual lines involved in the failing ref matches all make use of this FilterUnique method (and is it only the case when FilterUnique is checking against Brunel containers, or any time that this method is being used? (Are there other cases where it's checking against two different types of containers))? Hopefully they are all related to this, and we don't have to search for a second unrelated issue.. Awesome job!

I have only scratched the surface with the QEE lines (i.e. the StdJets related failures) but I believe it is unrelated to FilterUnique.

The usage of FilterUnique by itself is not enough to cause these problems. You would need two containers with overlapping candidates that have slightly different reconstruction.

As for other places where FilterUnique might be used in Stripping, a quick search for Unique=True and Unique = True yields StrippingInclusiveDoubleD as the only other instance. But I would need to spend more time looking at it to know if it is problematic.

Hello @avenkate, as I'm back from hols and the campaign is full steam, I thought I would ask for a little update on this important matter. We should really try hard to have this saga finalised before the full production starts after the Summer, right? Thanks.

Hi @erodrigu , are you referring to the reproducibility issues in general, or to FilterUnique specifically?

For the former, I am making progress with the StdJets related issues. The difference in the number of warnings comes because a different number of particles get filled to the Phys/PFParticles TES location. This TES location gets filled in stages (in Phys/JetAccessories/ParticleFlow.cpp), with loops over charged particles, VELO particlesm, Merged Pi0s, Resolved Pi0s, Photons, HCAL clusters and neutral recovery particles. I have narrowed down the culprits to the last three, i.e. Photons, HCAL clusters and neutral recovery. The number of these particles added in these categories is what is varying from run to run. I have dived deeper into photons, in the function treatPhotons. The reproducibility issues with photons arise in the lines 989

if ( ! hypoClusters[0] || m_PFCaloClusters[hypoClusters[0].target()->seed().all()]->status() != LHCb::PFCaloCluster::Available  ) continue;

because the status of the cluster seems to flip for some reason between 0 (Available) and 1(AvailableForNeutralRecovery) (still need to understand why),

line 991

if ( m_banFromTTrack ){

and line 993

if (tRange.size() > 0 ){

for reasons I am yet to understand. Then there are still the HCAL clusters and neutral recovery, which I still haven't gotten to. As you can see, this beast is more complicated than the FilterUnique case, and also operates more sinisterly, since I see cases where there is one less particle from say Photons and one more from the HCAL cluster, and the total matches the ref, but something has still gone wrong underneath.

With regard to FilterUnique specifically, it is on my agenda to write some fix, but that is all.

Currently, I am trying to find ways to reduce the bandwidth of the stripping campaign, and this is taking most of my time.

Good morning @avenkate, thanks a lot for your detailed reply.

I was indeed wondering about the 2 aspects :-). Personally I reckon that having part of the issues fixed and in is better than waiting for the full thing, which is anything but a trivial task. As such, I would prioritise at this point the fix to FilterUnique, ensuring a test is in place and checking the lines that use it. Then I would move to the remainder, which is even trickier. For the latter I would list here for bookkeeping all lines that can be affected by their usage of StdJets (maybe already above in the long thread, I did not check).

My 2 cents :-).

Fingers crossed regards your investigations.

Cannot get consistent results for S28r2 and S34r0p1 locally

In stripping24r2-28r2-patches

In 2018-patches

Child items ...

Activity