Fix NaN ratios in MooreOnline throughput tests

changed the description

As I comment in lhcb/Moore#786 (closed) if there is more than one match I think just removing the check and picking one of them (the first ?) is far from ideal. Better to arrange for the regex to only match the correct one.

The 2 lines looks like this:

08:50:10.011 emulator.py:439           INFO     Average total throughput: Evts/s = 167.2
09:01:15.007 emulator.py:439           INFO     Average total throughput: Evts/s = 97.1

they are from two successive process that run the same application, but anyway the order never change (first one is throughput, second is profile)

It's still an assumption that allowing multiple matches and picking the first will always remain correct. The check was there in the first place presumably for good reasons.

b.t.w. The failure mode we have right now, for when there are multiple matches is also less than ideal.

The current check is presumably used by a number of tests, not just one failing now, so removing it does not seem right to me.

Does the second run, the profile one, need to report the throughput via the INFO message

09:01:15.007 emulator.py:439           INFO     Average total throughput: Evts/s = 97.1

??

If not, then for me the better fix here is to turn off this printout for that run of the task. Either global, i.e. limit everything to WARNING or above, or if other INFO messages are needed turn it off just for the component in question.

Lets test with this fix and re-discuss this if we see failures in other tests. There is no point trying to fix what is not proven to be broken. Anyway this only concerns the 3 throughput messages that get reported when we run the ci-test, 2 out of 3 are currently broken, this is unlikely to make things worse.

Sorry, but I stand by my objection to removing the check, as I think there are better ways to fix this.

I don't see a better way to fix this, but feel free to have a look.

just don't print the throughput for the profile run ....

My primary objection is by removing the check you are introducing the potential for unknown issues in the future. You cannot say for sure that at some point in the future some other change will be made that results in multiple matches again, but this time the first one is not the correct one to take. No one can guarantee that will never happen. If you remove the check then the code will silently carry on reporting the first match. That is the reason I object to removing a check which is doing precisely what it was designed to do.

In this case, the correct fix, for me, is to arrange for there to be only one match, by preventing the runs where the throughput values are not needed from printing them in the first place.

here is where the throughput is printed: https://gitlab.cern.ch/lhcb/MooreOnline/-/blob/master/MooreScripts/python/MooreScripts/testbench/emulator.py?ref_type=heads#L436

here is where the 2 jobs are run with the testbench: https://gitlab.cern.ch/lhcb-datapkg/PRConfig/-/blob/master/scripts/benchmark-scripts/MooreOnline_hlt2_pp_2024_data.sh

we could remove the print if the --use-perf-control flag is active, but this sound to me like a more invasive change that could impact more tests... I don't know enough about this system to make a decision, i'll wait for input from @raaij @gunther or @rmatev

I see those two runs write to two different log files ThroughputTest.log and then ProfileTest.log. can we not arrange for the handler here to only look in the first log and not the second somehow ?

https://gitlab.cern.ch/lhcb-datapkg/PRConfig/-/blob/master/scripts/benchmark-scripts/MooreOnline_hlt2_pp_2024_data.sh#L41

Does that line not imply if the handler only looks in ThroughputTest.log it won't find anything from the profile run ? Isn't that what the intention is (the comment seems to imply this)?

OK, I see now the problem. I guess the handler does not get these individual logs, but somehow has to parse logs like

https://lhcbpr-hlt.web.cern.ch/PerfTests/UpgradeThroughput/Throughput_lhcb-2024-patches.68_MooreOnline_hlt2_pp_2024_data_x86_64_v3-el9-gcc13-opt+g_2024-06-14_08:35:35_+0200/tests.log

which has them all munged together.

Still seems to me just arranging for the perf run of the task to not print

09:01:15.007 emulator.py:439           INFO     Average total throughput: Evts/s = 97.1

is the best way to go here.

What output precisely is needed from the perf profile run of the task ? If all that is needed from that is the perf results, and no 'Gaudi' message are needed then we could simply use

https://gitlab.cern.ch/lhcb/MooreOnline/-/blob/master/MooreScripts/scripts/testbench.py?ref_type=heads#L65

to set the global Gaudi output level to 4 (WARNING) for the profile run.

That assumes of course the INFO above is coming from a Gaudi component, which I am not 100% sure on.

The INFO is comming from https://gitlab.cern.ch/lhcb/MooreOnline/-/blob/master/MooreScripts/python/MooreScripts/testbench/emulator.py?ref_type=heads#L439 and indeed, I don't think the perf task needs this.

yeah, I just found that myself. So some way to turn off the reporting of throughput for the perf run I think is what is needed here.

@ahennequ Maybe this option could be used (or extended) to do what we want here.

https://gitlab.cern.ch/lhcb/MooreOnline/-/blob/master/MooreScripts/scripts/testbench.py?ref_type=heads#L109

currently the script sets this to 300 for the perf run. Does this need to be set at all for the perf run ? If we set it to zero (or less) does that disable measuring throughput completely without affecting what the perf run needs ?

I'll add an other option to the testbench we can then enable when running the profile. Then I'll let people who can merge choose what solution they want to solve the issue.

I've opened lhcb/MooreOnline!465 (merged) as an alternative to fix this issue. You can choose to merge this one or the other.

mentioned in issue lhcb/Moore#786 (closed)

For the HLT1 GPU throughput test the issue is that the upload of the log files - including the one for the reference run - fails due to an issue with credentials. @joel has been looking at it.

mentioned in merge request lhcb/MooreOnline!465 (merged)

changed the description

mentioned in merge request lhcb/Allen!1679 (merged)

closed

Fix NaN ratios in MooreOnline throughput tests

Closed by Arthur Marius Hennequin 10 months ago (Jun 20, 2024 12:37pm UTC) 10 months ago

Activity

Fix NaN ratios in MooreOnline throughput tests

Merge request reports

Closed by Arthur Marius Hennequin 10 months ago (Jun 20, 2024 12:37pm UTC) 10 months ago

Activity