S28r2 cannot give consistent counters locally (ie. running the test twice produces different counters). The differences are small and do not affect the rates etc.
It may be worth updating this issue with the results of the rate tests that were reported by @atully at PAC today. At the same meeting, it was decided to proceed with marking the stripping24e2-28r2-patches branch obsolete, since this issue is seems unrelated to the move to 2018-patches.
Here, I’ll break down the reproducibility issues seen with the StrippingBu2LLK line. These issues are seen in the CounterMismatches sections of some nightly tests (for example see the test_stripping34r0p1_collision18_reco18 in test 1287) in Stripping34r0p1 (though likely it affects other campaigns too) as
So the problem shows up in the B+ -> J/psi K*+, K*+ -> KS0 pi+ sector, where a slightly different number of K*+, and therefore a slightly different number of B+ are created from run to run.
Two sources of Ks0 are used: StdVeryLooseKsLL (the usual Ks0 container) and StdKs2PiPiLL (a container of Ks0 from Brunel). These are both passed (see lines 476-480) to the _makeKstarPlus function. Inside this function (see line 1028 onwards), a MergedSelection is done on the two Ks0 sources to combine them, with Unique=True. The Unique=True flag activates the usage of FilterUnique. The purpose of FilterUnique is to reject cases where two Ks0 particles share exactly the same daughters. The FilterUnique code orders the Ks0 according to memory address, and keeps the one of the Ks0 duplicates which comes last in memory address. This changes from run to run.
Normally this would not be a problem, but it looks like slightly different reconstruction algorithms are running to make the Brunel container than what make the normal container. I verify this by printing out the vertex chi2 of the Ks0 candidates. When there are duplicates, they have slightly different properties. Then FilterUnique randomly keeps one of these. Thus we always have the same number of Ks0 passing, but their properties are slightly different. This translates in a slightly different number of K*+ candidates passing the MotherCut in the CombineParticles inside _makeKstarPlus, and then in a slightly different number of B’s.
Below are snippet of logs (full logs stdout_1 and stdout_2 attached) from two consecutive runs of the stripping line on the same input data. I modified the FilterUnique code to print out the addresses and some other properties of the candidates in its input, good and filtered containers. The removal of duplicates happens between the good and filtered containers. We see a case of duplicate Ks0 with the same daughters (having keys 4 and 24). But one has a vertex chi2 of 0.323 and other has 0.321, and the FilterUnique randomly saved a different one in both runs. Note in the snippets below, the container is erroneously labeled MergedKstarsPlusLLForBu2LLK when in fact it contains Ks0.
Now, this just shows up for B+ -> J/psi K*+, K*+ -> Ks0 but in principle it affects all cases where the combination of Brunel candidates and normal candidates is done with FilterUnique.
Maybe we can get @ibelyaev and @pkoppenb involved since these is now going beyond stripping. Guys could you have a look and maybe share your thoughts here?
Good catch!!! So we have to find another, deterministic way of filtering unique. Would it introduce an unacceptable bias if we always selected the StdVeryLooseKsLL option in the case where the candidates are otherwise identical?
They are almost identical. If they were identical, the random selection wouldn't matter. It's a kind of multiple candidate selection, but a good one as we can emulate it on signal (the annoying ones are those that have interference between signal and background). This selection has an efficiency on signal, which can be measured in MC. Giving priority to =StdVeryLooseKsLL= would also have an efficiency (likely almost the same) and can also be measured on signal MC. So no problem.
Does the filter algorithm allow to give priority to a given container? Or does that need some C++ hacking?
Yes, in the sense that it's as random as selecting on pointer, it is just the order in which the particles are created and entered in the original container. But needs to be checked if it's really unique: since in this LHCb::Particle::ConstVector you have Particles coming from two different KeyedContainers, it may be that you have duplicates (the key is set when a Particle is added to a KeyedContainer)
This is all a really impressive cacth @avenkate !
Is it simple enough to check that the usual lines involved in the failing ref matches all make use of this FilterUnique method (and is it only the case when FilterUnique is checking against Brunel containers, or any time that this method is being used? (Are there other cases where it's checking against two different types of containers))?
Hopefully they are all related to this, and we don't have to search for a second unrelated issue..
Awesome job!
I have only scratched the surface with the QEE lines (i.e. the StdJets related failures) but I believe it is unrelated to FilterUnique.
The usage of FilterUnique by itself is not enough to cause these problems. You would need two containers with overlapping candidates that have slightly different reconstruction.
As for other places where FilterUnique might be used in Stripping, a quick search for Unique=True and Unique = True yields StrippingInclusiveDoubleD as the only other instance. But I would need to spend more time looking at it to know if it is problematic.
Hello @avenkate, as I'm back from hols and the campaign is full steam, I thought I would ask for a little update on this important matter. We should really try hard to have this saga finalised before the full production starts after the Summer, right? Thanks.
Hi @erodrigu , are you referring to the reproducibility issues in general, or to FilterUnique specifically?
For the former, I am making progress with the StdJets related issues. The difference in the number of warnings comes because a different number of particles get filled to the Phys/PFParticles TES location. This TES location gets filled in stages (in Phys/JetAccessories/ParticleFlow.cpp), with loops over charged particles, VELO particlesm, Merged Pi0s, Resolved Pi0s, Photons, HCAL clusters and neutral recovery particles. I have narrowed down the culprits to the last three, i.e. Photons, HCAL clusters and neutral recovery. The number of these particles added in these categories is what is varying from run to run. I have dived deeper into photons, in the function treatPhotons. The reproducibility issues with photons arise in the lines 989
if ( ! hypoClusters[0] || m_PFCaloClusters[hypoClusters[0].target()->seed().all()]->status() != LHCb::PFCaloCluster::Available ) continue;
because the status of the cluster seems to flip for some reason between 0 (Available) and 1(AvailableForNeutralRecovery) (still need to understand why),
for reasons I am yet to understand. Then there are still the HCAL clusters and neutral recovery, which I still haven't gotten to. As you can see, this beast is more complicated than the FilterUnique case, and also operates more sinisterly, since I see cases where there is one less particle from say Photons and one more from the HCAL cluster, and the total matches the ref, but something has still gone wrong underneath.
With regard to FilterUnique specifically, it is on my agenda to write some fix, but that is all.
Currently, I am trying to find ways to reduce the bandwidth of the stripping campaign, and this is taking most of my time.
Good morning @avenkate, thanks a lot for your detailed reply.
I was indeed wondering about the 2 aspects :-). Personally I reckon that having part of the issues fixed and in is better than waiting for the full thing, which is anything but a trivial task. As such, I would prioritise at this point the fix to FilterUnique, ensuring a test is in place and checking the lines that use it. Then I would move to the remainder, which is even trickier. For the latter I would list here for bookkeeping all lines that can be affected by their usage of StdJets (maybe already above in the long thread, I did not check).