This was there before !685 (merged). I believe the error was introduced with !561 (merged), where other data was introduced as part of the test (in that MR the pipeline tab, the penultimate test showed it for instance: https://gitlab.cern.ch/lhcb/Allen/-/jobs/15914972). I would suggest to update the description of the issue.
The status since has been intermittently failing physics efficienty tests, both for the basic pipeline and the full pipeline. I thought an issue already existed but to my surprise that is not the case.
One thing that I noticed looking at different pipelines is that we have 2 effects. As you say, we have sometimes that 0.01% difference in the hit efficiency either in the basic or full pipeline. But now I also see this difference in the Hlt1TwoTrackMVA_Restricted, Hlt1TwoTrackMVA_Non_Restricted and Inclusive (see here https://gitlab.cern.ch/lhcb/Allen/-/jobs/17508160 or https://gitlab.cern.ch/lhcb/Allen/-/jobs/17446629). Is this second problem coming from !561 (merged) also? Or this comes actually from !685 (merged) ?
Which means that !685 (merged) did not introduce this issue. Also please note that !666 (merged) definitely did not introduce the issue either considering the nature of that MR, so it must have existed from before. The hit purity issue is also observed in:
The hit purity issue is observed more times, although this might be because it's seen in the basic pipeline whereas the other is only observed in the full pipeline, which is triggered with less frequency. If we are lucky, these two come from a single issue.
The differences in Hlt1TwoTrackMVA efficiency are huge and I'm having trouble finding pipelines that pass this test, so this looks to me like the complex sequence validation reference files are just incorrect. Maybe they weren't updated as part of the change from a cut based selection to the NN-based selection? Or maybe they were updated, but then were somehow reverted during the confusion around the CI pipeline? @dovombru@nnolte does this sound possible?
i did update the refs at the time, but i remember that some numbers vastly changed after another MR went in, don't quite remember which one that was. (it used to have a higher efficiency than the catboost variant)
Anyway, I can test this and create a MR with an update. @dcampora the reference file in my local copy of !672 (merged) have numbers that agree with the failing MRs.
One possible explanation to the hit purity counters being wrong is how these counters are calculated, which is by re-weighting the previous calculation and adding the new track each time. This could lead to different results when track orders are different. !710 (merged) improves the stability of this calculation.
@thboettc from the first attempts it looks like !710 (merged) may solve the hit purity issue. I just updated the reference files there (this includes an update to the output of the complex sequence). Feel free to take over this MR in case it is useful.