The hlt2_protoparticles_baseline test is failing since May 13 (lhcb-master/1681). The incomplete reference update seems to have been absorbed in the meantime with other reference updates.
However, in today's lhcb-master/1691, the -dbg test fails whereas the -opt test passes. This inconsistency must have started some time ago, and could also be the source of this test having bad reference updates (since forcing in that particular case does not help).
Also, in today's build, and the last one (May 21), the test took 18k seconds in -dbg. In the previous build, it took 10 minutes. This is also true in lhcb-head, so it's probably a real issue.
I think (part of) the problem is that it simply generates too much text output. In my browser I see:
I changed the formatting of the output, such that it is easier to spot values in the sea of text, but maybe one should just reduce the number of events in runs over.
It seems that the validation with reference is extremely slow, and to be more precise difflib.ndiff is. It is fast when there are no differences but otherwise slow. There are a few issues to address:
Make the diff faster and/or simply give up if the input is too big.
Reduce the number of lines printed (fewer events, and/or fewer lines per event)
Do not allow a validator to run for that long: gaudi/Gaudi#222
Understand why the test on -dbg produces different numbers
I was discussing this problem with @sponce a few days ago and kind of came to the same conclusion: the diff takes too long, mostly because we have HUGE stdout.
We could make the diff faster by passing a None as charjunk argument to difflib.ndiff, but that will help only up to a point.
If we really want to compare structured information between the test job and the reference we should stop using stdout and dump the data in some more efficient format and compare it to what we expect.
From a WP2 point of view, I think comparing less events (2 or 5) should be ok.
I mean, at the moment we cannot do a reasonable comparison at all, so anything is an improvement
I’ve run into this myself with some rich tests recently. I am not convinced we can just blame huge output though. Sure, for some tests perhaps the output is too verbose, but for the tests I had issues with it was not that big. Also, just running diff at the command line on the two logs, after the fact, was more or less instantaneous. So why is the diff when run inside the tests so very very slow ?
This python bug report (from 2009) gives a nice overview of the issues: https://bugs.python.org/issue6931#msg223459. I think the only logical conclusion to draw from there is "don't use difflib.ndiff for anything beyond a few hundred lines"
I merged a mitigation: !1577 (merged). As written there, the GroupMessages preprocessor led to ndiff trying to compare very long lines, which apparently is a problem for it.
This only solves the long runtime. The underlying issue remains, -dbg builds produce different output compared to -opt builds. See output_diff
It only solves it for the one Moore test here, not all tests, such as the ones in Rec I have problems with. As @raaij the real solution is to migrate away from using difflib.ndiff at all.
I know I'm repeating myself, but comparing stdout is a temporary workaround waiting to have proper tests.
If you have to dump a couple of MB of numbers to compare them with the expected values you should not (ab)use the stdout comparison machinery, go for JSON, CSV, ROOT (if you dump JSON you can read it back easily in Python and use Python existing tools to get a nice comparison between dictionaries, like https://docs.python.org/3/library/unittest.html#unittest.TestCase.assertEqual).
everyone is right here, the "real solution" is many-fold.
@jonrob could you please give more details about the issue you saw. Where can we see it and how can we reproduce? You could also open a new issue with the failing build label if appropriate.
(the mitigation I merged reduces test runtimes by 5 hours, something I would consider rather urgent)
I've been seeing it with the RichDetector tests in LHCb, in particular the instantiate test. If anything changes, the diffs take ages to compute and my workaround has been to locally delete the ref before rerunning the test, so basically create it from scratch.
Yes, the ref here is M(1MB) in size, but I still maintain diffing it shouldn't take as as it does. I this case I've crafted the printout this test specifically makes to work with this test, so whilst yes I am relying on stdout diffing, IMHO in this case its not a bad approach to take to the test.