Fix NaN ratios in MooreOnline throughput tests
Merge request reports
Activity
As I comment in lhcb/Moore#786 (closed) if there is more than one match I think just removing the check and picking one of them (the first ?) is far from ideal. Better to arrange for the regex to only match the correct one.
The 2 lines looks like this:
08:50:10.011 emulator.py:439 INFO Average total throughput: Evts/s = 167.2 09:01:15.007 emulator.py:439 INFO Average total throughput: Evts/s = 97.1
they are from two successive process that run the same application, but anyway the order never change (first one is throughput, second is profile)
Edited by Arthur Marius HennequinDoes the second run, the profile one, need to report the throughput via the INFO message
09:01:15.007 emulator.py:439 INFO Average total throughput: Evts/s = 97.1
??
If not, then for me the better fix here is to turn off this printout for that run of the task. Either global, i.e. limit everything to WARNING or above, or if other INFO messages are needed turn it off just for the component in question.
Lets test with this fix and re-discuss this if we see failures in other tests. There is no point trying to fix what is not proven to be broken. Anyway this only concerns the 3 throughput messages that get reported when we run the ci-test, 2 out of 3 are currently broken, this is unlikely to make things worse.
My primary objection is by removing the check you are introducing the potential for unknown issues in the future. You cannot say for sure that at some point in the future some other change will be made that results in multiple matches again, but this time the first one is not the correct one to take. No one can guarantee that will never happen. If you remove the check then the code will silently carry on reporting the first match. That is the reason I object to removing a check which is doing precisely what it was designed to do.
In this case, the correct fix, for me, is to arrange for there to be only one match, by preventing the runs where the throughput values are not needed from printing them in the first place.
Edited by Christopher Rob Joneshere is where the throughput is printed: https://gitlab.cern.ch/lhcb/MooreOnline/-/blob/master/MooreScripts/python/MooreScripts/testbench/emulator.py?ref_type=heads#L436
here is where the 2 jobs are run with the testbench: https://gitlab.cern.ch/lhcb-datapkg/PRConfig/-/blob/master/scripts/benchmark-scripts/MooreOnline_hlt2_pp_2024_data.sh
we could remove the print if the
--use-perf-control
flag is active, but this sound to me like a more invasive change that could impact more tests... I don't know enough about this system to make a decision, i'll wait for input from @raaij @gunther or @rmatevDoes that line not imply if the handler only looks in
ThroughputTest.log
it won't find anything from the profile run ? Isn't that what the intention is (the comment seems to imply this)?OK, I see now the problem. I guess the handler does not get these individual logs, but somehow has to parse logs like
which has them all munged together.
Still seems to me just arranging for the perf run of the task to not print
09:01:15.007 emulator.py:439 INFO Average total throughput: Evts/s = 97.1
is the best way to go here.
Edited by Christopher Rob JonesWhat output precisely is needed from the perf profile run of the task ? If all that is needed from that is the perf results, and no 'Gaudi' message are needed then we could simply use
to set the global Gaudi output level to 4 (WARNING) for the profile run.
That assumes of course the INFO above is coming from a Gaudi component, which I am not 100% sure on.
The INFO is comming from https://gitlab.cern.ch/lhcb/MooreOnline/-/blob/master/MooreScripts/python/MooreScripts/testbench/emulator.py?ref_type=heads#L439 and indeed, I don't think the perf task needs this.
@ahennequ Maybe this option could be used (or extended) to do what we want here.
currently the script sets this to 300 for the perf run. Does this need to be set at all for the perf run ? If we set it to zero (or less) does that disable measuring throughput completely without affecting what the perf run needs ?
Edited by Christopher Rob JonesI've opened lhcb/MooreOnline!465 (merged) as an alternative to fix this issue. You can choose to merge this one or the other.
mentioned in issue lhcb/Moore#786 (closed)
For the HLT1 GPU throughput test the issue is that the upload of the log files - including the one for the reference run - fails due to an issue with credentials. @joel has been looking at it.
mentioned in merge request lhcb/MooreOnline!465 (merged)
mentioned in merge request lhcb/Allen!1679 (merged)