hlt2_protoparticles_baseline output is different in dbg builds

I think (part of) the problem is that it simply generates too much text output. In my browser I see:

I changed the formatting of the output, such that it is easier to spot values in the sea of text, but maybe one should just reduce the number of events in runs over.

yes, this is certainly part of the issue.

It seems that the validation with reference is extremely slow, and to be more precise difflib.ndiff is. It is fast when there are no differences but otherwise slow. There are a few issues to address:

Make the diff faster and/or simply give up if the input is too big.
Reduce the number of lines printed (fewer events, and/or fewer lines per event)
Do not allow a validator to run for that long: gaudi/Gaudi#222
Understand why the test on -dbg produces different numbers

I was discussing this problem with @sponce a few days ago and kind of came to the same conclusion: the diff takes too long, mostly because we have HUGE stdout.

We could make the diff faster by passing a None as charjunk argument to difflib.ndiff, but that will help only up to a point.

If we really want to compare structured information between the test job and the reference we should stop using stdout and dump the data in some more efficient format and compare it to what we expect.

From a WP2 point of view, I think comparing less events (2 or 5) should be ok. I mean, at the moment we cannot do a reasonable comparison at all, so anything is an improvement

I’ve run into this myself with some rich tests recently. I am not convinced we can just blame huge output though. Sure, for some tests perhaps the output is too verbose, but for the tests I had issues with it was not that big. Also, just running diff at the command line on the two logs, after the fact, was more or less instantaneous. So why is the diff when run inside the tests so very very slow ?

This python bug report (from 2009) gives a nice overview of the issues: https://bugs.python.org/issue6931#msg223459. I think the only logical conclusion to draw from there is "don't use difflib.ndiff for anything beyond a few hundred lines"

Indeed, it seems we are not using the right tool here. Thanks for this pointer @raaij !

I merged a mitigation: !1577 (merged). As written there, the GroupMessages preprocessor led to ndiff trying to compare very long lines, which apparently is a problem for it.

This only solves the long runtime. The underlying issue remains, -dbg builds produce different output compared to -opt builds. See output_diff

It only solves it for the one Moore test here, not all tests, such as the ones in Rec I have problems with. As @raaij the real solution is to migrate away from using difflib.ndiff at all.

I know I'm repeating myself, but comparing stdout is a temporary workaround waiting to have proper tests.

If you have to dump a couple of MB of numbers to compare them with the expected values you should not (ab)use the stdout comparison machinery, go for JSON, CSV, ROOT (if you dump JSON you can read it back easily in Python and use Python existing tools to get a nice comparison between dictionaries, like https://docs.python.org/3/library/unittest.html#unittest.TestCase.assertEqual).

everyone is right here, the "real solution" is many-fold.

@jonrob could you please give more details about the issue you saw. Where can we see it and how can we reproduce? You could also open a new issue with the failing build label if appropriate.

(the mitigation I merged reduces test runtimes by 5 hours, something I would consider rather urgent)

I've been seeing it with the RichDetector tests in LHCb, in particular the instantiate test. If anything changes, the diffs take ages to compute and my workaround has been to locally delete the ref before rerunning the test, so basically create it from scratch.

Yes, the ref here is M(1MB) in size, but I still maintain diffing it shouldn't take as as it does. I this case I've crafted the printout this test specifically makes to work with this test, so whilst yes I am relying on stdout diffing, IMHO in this case its not a bad approach to take to the test.

mentioned in commit 7946bc95

mentioned in commit 3f83f763

mentioned in merge request !1577 (merged)

mentioned in commit d0ee8a5a

is someone working on the actual issue (dbg builds giving different results)? We either need to fix that or work around it

The test have been disabled, so removing the label but keeping the issue

changed title from hlt2_protoparticles_baseline test is failing to hlt2_protoparticles_baseline output is different in dbg builds

removed failing build label

mentioned in issue #492

marked this issue as related to #492

hlt2_protoparticles_baseline output is different in dbg builds

Designs

Child items ...

Activity

hlt2_protoparticles_baseline output is different in dbg builds

Relates to

Activity